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INTRODUCTION 


ORA-Operations  Research  Associates  has  surveyed  and  reviewed  the 
literature  on  the  design  and  construction  of  questionnaires  as  part  of 
a contract  with  the  Army  Research  Institute  for  the  Behavioral  and 
Social  Sciences,  Fort  Hood,  Texas.  This  report  presents  the  results  of 
that  survey  and  review.  It  is  based  on  a broad  definition  of  question- 
naire to  include  scales,  structured  interview  forms,  survey  forms,  and 
similar  paper  and  pencil  instruments  used  to  elicit  responses  and  collect 
information. 

The  emphasis  of  this  review  was  on  questionnaires  used  with  Army 
personnel  participating  in  military  field  tests  concerned  with  evaluating 
training , equipment , organizations,  concepts,  and  doctrine,  but  little  was 
found  on  this  topic.  However,  since  considerations  affecting  question- 
naire construction  for  Army  field  test  evaluations  are  common  to  question- 
naire construction  for  other  uses,  this  review  covers  the  pertinent 
literature  from  other  fields.  The  review  was  not  concerned  with  the 
evaluation  of  soldier  attitudes  or  reactions  pertaining  to  societal  prob- 
lems sonality,  academic  testing,  or  similar  research  areas  except  as 

th'  ed  methodological  considerations  were  also  applicable  to  field 

Fmphasis  was  placed  on  those  sources  which  provided  empirical 
,..^stionnaire  construction.  Material  on  the  administration  and 
■s  of  questionnaires  and  on  questionnaire  application  and  results  was 
excluded  except  where  specifically  related  to  questionnaire  construction. 
Topics  not  stressed  in  the  literature  review  are  noted  as  appropriate  in 
the  text. 

The  literature  search  was  quite  comprehensive  and  included  the  re- 
view of  journals,  books,  and  reports  in  the  fields  of  psychology,  educa- 
tion, sociology,  marketing,  and  the  military.  Both  hand  and  computer 
searches  were  itiade.  Computer  searches  were  made  of  information  retrieval 
systems  maintained  by;  the  American  Psychological  Association  for 
Psychological  Abstracts  covering  the  years  1967  to  1974;  the  Educational 
Resources  Information  Center  for  the  years  1957  to  1974;  the  National 
Technical  Information  Service  for  1963  to  1974;  the  Defense  Documentation 
Center;  and  the  Bureau  of  the  Census. 

Hand  searches  were  made  to  supplement  the  computer  searches  and  in- 
cluded the; 

Psychological  Abstracts  for  1949  through  1967 ; Annual  Reviews  of 
Psychology  for  1960  through  1974;  Journal  of  Marketing  for  1942  to 
1974;  Journal  of  Advertising  Research  for  1960  to  1974;  Journal  of 
Marketing  Research  for  1964  to  1974;  Business  Periodicals  Index  tor  1951 
to  1974;  and  Public  Administration  Information  Service  for  1949  to  1974. 


Hand  searches  were  also  made  of  several  bibliographies;  Goheen  and 
Kavruck  (1950)  covered  the  early  work  for  the  years  1929  to  1949;  Potter, 
Sharpe,  Hendee,  and  Clarke  (1972)  covered  more  recent  work;  the  ARI  field 
Unit  at  MASSTER,  Fort  Hood,  provided  a March,  1974,  short  bibliography 
on  the  subject;  and  the  in-process  bibliography  of  the  Army's  Test  and 
Evaluation  Command  was  also  reviewed.  Finally,  the  articles  abstracted 
were  reviewed  for  references,  as  were  recognized  pertinent  texts  and  staff 
personal  files. 

The  literature  search  yielded  a total  of  over  2,000  citations  on 
questionnaire  construction  and  methodology;  however,  abstracts  were  only 
available  or  prepared  for  about  half  of  the  citations.  This  limitation 
was  imposed  by  the  level  of  effort  available,  and  the  selections  were  made 
on  the  basis  of  the  apparent  relevance  of  each  citation,  judging  primarily 
from  its  title  or  abstract  if  available.  The  actual  writing  of  the 
following  chapters  was  based  on  a selection  from  these  abstracts,  with 
occasional  reference  to  the  actual  articles,  depending  on  the  organization- 
al needs  of  the  chapter  as  seen  by  its  author.  The  articles  actually 
cited  in  the  writing  are  included  in  the  attached  bibliography  and  are 
identified  by  asterisks. 

The  results  of  this  literature  search  were  used  as  a basis  for  the 
development  of  a manual  on  questionnaire  construction  (Dyer,  Matthews, 
Wright,  6c  Yudowitch,  1975).  The  manual  was  prepared  for  use  as  a guide 
by  personnel  charged  with  the  development  of  questionnaires  for  use  in 
Army  field  test  evaluations.  It  includes  chapters  on  topics  discussed 
in  this  report. 


In  the  text  which  follows.  Chapters  II  through  XI  were  selected  and 
organized  to  cover  comprehensively  and  with  minimal  overlap  the  technical 
objectives  of  the  study  contract  between  ORA  and  ARI.  These  chapters 
also  include  for  completeness  some  additional  parallel  items.  Chapter  II 
discusses  the  advantages  and  disadvantages  of  various  types  of  question- 
naires. Chapter  III  considers  the  selection  of  questionnaire  items  to  be 
used,  including  the  content  of  questionnaire  items  and  the  pros  and  cons 
of  using  various  types  of  questionnaire  items.  Chapter  IV  notes  articles 
about  various  scaling  techniques.  The  effects  of  variations  in  the  pre- 
sentation of  questionnaire  items  are  covered  in  Chapter  V,  while  Chapter 
VI  reviews  articles  on  the  number  of  response  alternatives  and  response 
anchoring.  The  order  of  perceived  favorableness  of  commonly  used  words 
and  phrases  is  the  topic  of  Chapter  VII.  Chapter  VIII  examines  consider- 
ations related  to  the  physical  characteristics  of  questionnaires,  while 
considerations  related  to  the  administration  of  questionnaires  are  covered 
in  Chapter  DC.  Characteristics  of  respondents  that  influence  questionnaire 
results,  including  various  biases  and  response  sets,  are  discussed  in 
Chapter  X,  while  Chapter  XI  is  devoted  to  considerations  related  to  the 
evaluation  of  questionnaire  results.  Finally,  Chapter  XII  notes  recom- 
mended areas  for  further  research  based  upon  either  Identified  gaps  in  the 
empirical  research  or  contradictions  among  studies. 
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Chapter  II 


ADVANTAGES  AND  DISADVANTAGES  OF  VARIOUS  TYPES  OF  QUESTIONNAIRES 


This  chapter  discusses,  to  the  extent  articles  were  available  on  the 
topic,  some  of  the  advantages  and  disadvantages  of  using  various  types  of 
questionnaires,  as  the  word  "questionnaire"  was  defined  in  Chapter  I.  In 
the  first  section  below,  methods  to  measure  attributes  and  behavior  are 
mentioned.  Next,  the  structured  interview  is  first  compared  with  mail 
questionnaires,  and  then  with  oth(_r  types  of  questionnaires.  Comparisons 
between  open-  and  closed-ended  items  are  then  discussed. 

Methods  to  Measure  Attributes  and  Behavior 

There  are  a number  of  techniques  of  data  collection  that  can  be 
used  to  measure  human  attributes  and  behavior,  some  of  which  have  been 
reviewed  by  Deri,  Dinnerstein,  Harding,  and  Pepitone  (1948).  The  methods 
include  observation,  personal  and  public  records,  specific  performances, 
sociometry,  interviews,  questionnaires,  rating  scales,  pictorial  techniques, 
projective  techniques,  achievement  testing,  and  psychological  testing,  among 
others.  For  this  review,  however,  attention  has  been  restricted  to  a more 
limited  number  of  data  collection  techniques:  certain  paper  and  pencil  types 
of  instruments  broadly  classed  as  questionnaires  as  defined  in  Chapter  I, 
and  including  only  some  of  the  techniques  mentioned  above.  A distinction 
iias  also  been  made,  in  the  text  to  follow,  between  open-ended  questionnaire 
items  and  closed-ended  items.  Open-ended  items  are  those  which  permit  the 
respondent  to  express  his  opinions  in  his  own  words  and  to  indicate  anv 
qualifications  he  wishes.  Closed-ended  items,  on  the  other  hand,  utilize 
response  alternatives,  such  as  multiple  choice  or  true-false.  Structured 
interviews  are  included  within  the  definition  of  questionnaire  used  since 
typically  an  interview  schedule  is  developed  and  employed  by  an  interviewer 
both  for  asking  questions  and  recording  responses  much  like  a self-admini- 
stered questionnaire  with  open-ended  items.  This  distinction  is  not  as 
clear  as  it  might  be,  however,  since  some  investigators  (such  as  Paradise 
and  Blankenship,  1951)  admit  of  orally  administered  questionnaires, 
structured  interviews,  and  unstructured  interviews.  In  any  case,  unstruc- 
tured interviews  are  outside  the  scope  of  this  review,  and  they  will  not 
be  discussed  further. 


Comparison  of  the  Structured  Interview  and  Mail  Questionnaires 

During  the  literature  review,  attention  was  given  to  articles  on  the 
use  of  mail  questionnaires  only  to  the  extent  that  the  information  might 
be  genera lizable  to  other  types  of  questionnaires.  Accordingly,  any 
articles  related  to  sampling  considerations,  correcting  variance  estimates 
for  non-response,  etc.,  were  ignored.  Since  the  use  of  mail  questionnaires 
involves  the  consideration  of  issues  that  do  not  pertain  to  the  use  of 
other  types  of  questionnaires,  they  are  discussed  separately. 
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A number  of  criteria  was  employed  by  O'Dell  (1962),  who  compared 
personal  interviews  and  mail  questionnaires  using  identical  terms.  He 
found  an  interview  bias,  in  that  during  the  interviews  the  usage  of 
certain  types  of  products  was  understated  when  it  might  reflect  unfavorably 
on  the  respondent.  Wiseman  (1972),  compared  a mailed  questionnaire,  tele- 
ae  interview,  aid  personal  interview,  concluded  that  issues  involving 
ially  accepted  or  rejected  answers  will  effect  more  bias  in  interviews 
nan  in  questionnaires.  Ellis  (1948  ) similarly  found  more  self -revelatory 
or  unfavorable  responses  in  anonymous  mailed  questionnaires  than  in  inter- 
views. Ford  ( 1969)  asked  identical  questions  in  a mail  questionnaire 
followed  by  an  interview.  He  found  that  there  was  a consistency  of  response 
about  newspaper  readership  and  about  socioeconomic  factors,  but  inconsistency 
on  items  related  to  attitudes  and  opinions,  the  location  of  past  purchases, 
and  when  past  purchases  were  made.  A number  of  factors  may  have  influenced 
his  results,  however,  such  as  the  time  lapse  between  the  questionnaire 
completion  and  the  interview.  Williams  (1968)  noted  that  data  gathered 

by  telephone  interview  may  be  less  accurate  than  those  obtained  from  a i 

mail  questionnaire,  since  the  group  who  are  at  home  to  answer  the  tele-  [ 

phone  may  not  be  as  representative  as  those  to  whom  the  questionnaires  ! 

are  mailed.  i 

i 

The  comparative  costs  of  interviews  and  mailed  questionnaires  were 
discussed  in  five  articles.  Cahalan  (1951)  administered  a 23  oage  mailed 
questionnaire  to  1,051  Army  officers,  and  found  it  was  less  e. pensive, 
more  anonymous,  and  faster  than  the  interview  technique.  O'Dell  ('.962) 
reported  that  the  costs  of  interviewer  time  tended  to  outweigh  the  costs 
of  obtaining  and  maintaining  a mail  panel.  Gibson  and  Hawkins  (1968) 
concluded  that,  under  the  promise  of  anonymity,  the  questionnaire  should 
equal  the  interview  in  response  information  at  a much  smaller  expense 
(although  there  is  some  question  about  the  survey  design  they  employed). 

The  degree  of  consistency  between  interview  and  questionnaire  results  found 
by  Parker,  Wright,  and  Clark  (1957)  also  raised  questions  concerning  the 
justification  of  the  expense  of  interviewing,  when  questionnaires  or 
similar  techniques  would  be  only  slightly  less  reliable.  Sudman,  Greeley, 
and  Pinto  (1965)  were  somewhat  more  conservative  in  their  conclusion, 
reporting  that  costs  were  not  significantly  affected,  regardless  of  whether 
interviews,  mail  questionnaires,  or  a combination  of  both  were  employed. 

Specificity  of  responses  was  discussed  only  by  O'Dell  (1962).  He 
noted  that  noncommittal  responses  and  the  tendency  not  to  answer  open- 
ended  questions  were  more  prevalent  for  mail  questionnaires  than  for 
personal  interviews,  as  might  be  expected. 

Combinations  of  survey  methods  were  discussed  in  three  articles. 

Sudman,  Greeley,  and  Pinto  (1965)  found  that  self-administered  ouestion- 
naires  used  in  conjunction  with  personal  interviews  elicited  a slightly 
higher  cooperation/return  rate  from  respondents  than  either  used  alone. 

The  result  that  comparisons  between  interviews,  self-administered  question- 
naires, and  a combination  of  both  did  not  indicate  any  large  differences 
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suggested  to  them  that  additional  flexibility  should  be  considered  in 
methods  of  survey  research.  Payne  (1964)  also  suggested  that  sometimes 
a combination  of  survey  methods,  such  as  personal  interview,  telephone 
interview,  and  mail  questionnaires,  may  be  used  with  the  same  respondents 
to  produce  results  more  efficiently  than  one  method  alone  could  do.  How- 
ever, he  presented  no  firm  evidence  of  higher  reliability  or  validity 
for  combined  methods  over  individual  survey  methods.  Sharp  (1955)  found 
that  when  respondents  were  unable  to  give  complete  information  during  an 
interview  and  copies  of  a questionnaire  were  left  to  be  mailed  back, 

407o  were  returned  thus  eliminating  the  necessity  for  call-backs. 


Comparison  of  the  Structured  Interview  and  Other  Questionnaires 

Most  of  the  studies  comparing  the  structured  interview  with  question- 
naires other  than  mail  questionnaires  did  so  in  terms  of  the  consistency 
of  response  from  one  technique  to  the  other.  For  example,  Bennett,  Alpert, 
and  Goldstein  (1954),  though  working  with  only  16  subjects,  found  that 
26  out  of  30  questions  showed  significant  consistency  of  response  from 
a one  hour  interview  immediately  followed  by  the  use  of  a limited  response 
questionnaire  on  the  same  topic.  Consistency  coefficients  reported  were 
1.00  (perfect)  for  sociological  information,  .78  on  knowledge,  .69  on 
past  behavior,  and  .46  on  attitudes.  The  conclusion  reached  was  that  on 
information  other  than  sociological,  differences  in  response  will  be  noted 
between  interview  and  limited  choice  questionnaires,  especially  concerning 
attitudes . 

The  results  obtained  by  Bennett,  Alpert,  and  Goldstein  (1954)  appear 
to  have  been  supported  in  part  by  two  other  investigations.  Walsh  (1967) 
compared  the  accuracy  of  the  interview,  questionnaire,  and  personal  data 
blank  for  collecting  verifiable  biographic  information.  Comparing  collected 
data  to  available  records  for  270  students,  he  found  no  differences,  and 
concluded  that  biographic  data  may  be  collected  reliably  by  the  most  effi- 
cient means.  Boulger  (1970)  also  found  that  the  validity  of  response  to 
interviews  and  questionnaires  was  not  significantly  different  in  the 
elicitation  of  life  history  data. 

Three  studies  compared  structured  interviews  and  other  questionnaires 
in  the  measurement  of  attitudes.  In  the  first,  Metzner  and  Mann  (1952) 
followed  a fixed  alternative  questionnaire  administered  to  328  employees 
with  an  open-ended  interview.  They  noted  a tendency  for  the  employees  to 
rate  slightly  higher  in  the  interview  than  on  the  questionnaire.  There 
were,  however,  a number  of  limitations  to  the  study,  including  a two  iiionth 
time  lapse  between  completion  of  the  questionnaire  and  the  interview.  In 
the  second  study,  Wedel]  and  Smith  (1951)  found  that  interviewers  overesti- 
mated attitude  in  comparison  with  self-judged  attitude,  although  the  objective 
rating  of  interview  record  sheets  was  closer  to  self-rating  than  the  inter- 
viewers' rating.  Wheatley  (1973),  however,  found  no  significant  differences 
between  mean  scale  scores  for  two  groups,  one  of  which  expressed  their 
attitudes  during  a telephone  interview,  while  the  other  group  responded 
on  a self -administered  questionnaire. 


Although  studies  involving  the  use  of  questionnaires  for  the  measure- 
ment of  personality  were  generally  excluded  from  the  literature  review, 
three  were  considered  in  that  they  comnared  results  obtained  from  interviews 
and  questionnaires.  Eysenck  and  Eysenck  (1962)  sought  to  answer  the  question 
of  whether  an  interview-questionnaire  would  reveal  a factorial  structure 
essentially  identical  to  that  fouid  with  questionnaires  administered  in  the 
orthodox  manner.  The  results'  indicated  that  the  method  of  administration 
did  not  affect  the  factorial  composition  of  the  items,  which  measured 
extraversion  and  neuroticism.  Ambler,  Blair ,deRivera , Nelson,  and  Schoen- 
berger  (1958)  also  found  that  the  interview  and  questionnaire  methods  gave 
similar  results  in  the  classification  of  subjects  -according  to  three  levels 
of  anxiety  towards  flying.  The  conclusions  reached  by  Levonian  (1963), 
however,  were  different.  He  determined  the  reliability  of  three  short 
personality  scales  administered  by  the  interview  survey  method  to  432 
subjects.  The  values  were  sufficiently  less  than  the  consistency 
reliabilities  of  short  scale  personality  measures  obtained  by  the  usual 
questionnaire  survey  method  to  raise  serious  questions  about  the  adequacy 
of  such  personality  measures  obtained  by  the  interview  method. 

A comparison  of  interview  and  other  questionnaire  results  when  ego- 
involving questions  were  asked  was  the  topic  of  two  reports.  Knudsen, 

Pope,  and  Irish  (1967)  concluded  that  interviews  may  lessen  the  expression 
of  deviance,  compared  with  anonymous  questionnaires.  Based  on  three 
different  samples  of  white  women  all  of  whom  were  or  had  been  premaritally 
pregnant  for  the  first  time,  the  data  suggested  that  in  interview  situations 
respondents  were  more  likely  to  support  the  public  and  restrictive  sexual 
norms  that  they  assumed  were  adhered  to  by  the  interviewer.  In  the  private 
and  anonymous  questionnaire  situation,  the  respondents  more  often  answered 
to  subcultural  norms.  Ellis  (1947b)  compared  the  questionnaire  and  interview 
methods  in  the  study  of  human  love  relationships.  His  results  indicated 
that  the  great  majority  of  subjects  gave  less  favorable,  or  more  incrimi- 
nating, responses  to  the  questionnaires  than  they  did  to  the  interview. 

Ellis  concluded  that  for  more  ego-involving  questions  the  questionnaire 
may  produce  more  self -revelatory  data  than  the  interview. 


Comparison  of  Open-  and  Closed-Ende J Items 

Of  the  five  articles  that  compared  the  use  of  open-ended  and  closed- 
ended  questionnaire  items,  three  appeared  to  favor  the  use  of  the  open- 
ended  format,  at  least  for  the  factors  considered.  Ellenbogen  and  Danley 
(1962),  in  a study  of  the  comparability  of  responses  to  a socially 
concordant  question,  found  that  responses  were  more  varied  to  the  open- 
ended  question  than  to  the  closed,  although  the  closed  had  an  "other" 
category.  Asking  about  resources  of  helpful  health  advice,  they  also 
found  that  19%  of  the  responses  were  inconsistent,  in  that  sources  of  advice 
cited  in  the  open  question  were  omitted  in  the  closed. 

England  (1948)  compared  open-ended  and  dichotomous  items  about 
capital  punishment  in  three  survey  samples  of  2,000,  3,000,  and  6,000. 

The  results  gave  preference  to  the  ope.i-ended  items,  since  they  allowed  for 


TT-A 


the  expression  of  middle  party  opinions  that  the  dichotomous  items  forbid. 
However,  in  coding  the  open-ended  items,  expert  analysts  were  required  to 
obtain  reliable  results. 

The  results  of  a computer-assisted  method  of  free  response  (after 
which  the  respondents  evaluated  the  responses  they  generated  on  a rating 
scale)  was  compared  with  responses  to  prelisted  statements  in  a study  by 
Kohan,  deMille,  and  Myers  (1972).  Although  no  significance  tests  were 
reported,  the  free  response  method  appeared  to  generate  response  categories 
that  differed  rather  substantially  from  the  prelisted  statements.  Issues 
of  importance  that  were  overlooked  by  the  questionnaire  developers  were 
identified.  It  was  concluded  that  reliance  on  the  conventional  method 
may  distort  a study's  focus  by  obtaining  data  on  items  not  of  real  concern 
and  having  no  accurate  means  to  measure  concern.  The  authors  also 
noted  that  high  affirmative  levels  for  an  item  can  often  be  interpreted 
as  a response  set  or  lip  service,  while  responses  generated  by  unstructured 
methods  are  probably  more  reflective  of  personal  involvement  or  concern. 

The  study  favoring  the  use  of  close-ended  items  was  by  Scates  and 
Yoemans  (1950a).  It  was  undertaken  by  the  American  Council  on  Education 
to  determine  the  value  of  objective  tests  for  identifying  those  scientists 
and  engineers  who  were  likely  to  undertake  further  education.  It  was  con- 
cluded that  the  use  of  objective  tests  was  more  advantageous  than  the 
several  depth  essay  questions  used  in  a previous  study,  because  they  took 
less  time  and  were  therefore  more  acceptable  to  the  employees. 

The  best  summary  for  this  section  was  stated  by  Prien,  Otis,  Campbell  & 
Saleh  (1964).  They  noted  that  the  open-ended  type  of  questionnaire  has 
the  advantage  of  providing  unique  information,  whereas  the  objective  type 
of  questionnaire  is  generally  more  reliable.  The  combination  of  both,  they 
said,  would  appear  to  be  best. 


Conclus ions 


The  decision  about  which  type  of  questionnaire  to  use  depends  upon 
the  specific  research  question  that  one  is  attempting  to  answer  and  the 
practical  limitations  involved.  Both  structured  interviews  and  other  types 
of  questionnaires  appear  to  have  their  place  in  research  studies,  and  both 
have  have  their  limitations.  The  choice  of  which  to  use  may  well  depend 
upon  costs,  which  are  generally  lower  for  the  typical  questionnaire.  The 
typical  questionnaire  is  apparently  more  reliable,  while  the  structured 
interview  may  provide  more  unique  information.  If  the  dimensions  of  a 
problem  have  not  been  explored  before,  the  best  compromise  would  appear  to 
be  to  use  the  interview  approach  with  open-ended  items  to  uncover  the  dimen- 
sions, and  follow  this  by  the  use  of  the  more  reliable  paper  and  pencil 
questionnaire  to  obtain  more  specific  information. 


Chapter  III 


SELECTION  OF  QUESTIONNAIRE  ITEMS  TO  BE  USED 

Once  a decision  has  been  made  as  to  the  type  of  questionnaire 
instrument  to  use  (the  topic  of  Chapter  II),  the  specific  questionnaire 
items  to  be  administered  need  to  be  selected.  The  two  main  sections  in 
Chapter  III,  then,  address  the  content  of  questionnaire  items  and  the  pros 
and  cons  of  various  types  of  questionnaire  items. 


Content  of  Questionnaire  Items 

This  section  considers  first  methods  for  determining  questionnaire 
content,  and  then  other  issues  related  to  questionnaire  content. 

Methods  for  Determining  Questionnaire  Content 


There  are  a number  of  ways  that  can  be  used  to  determine  questionnaire 
content.  One  of  these  that  is  not  too  well  known  is  the  critical  incident 
technique.  As  noted  by  Flanagan  (1954)  the  critical  incident  technique 
consists  of  a set  of  procedures  for  collecting  direct  observations  of 
human  behavior  in  such  a way  as  to  facilitate  their  potential  usefulness 
in  both  solving  practical  problems  and  in  developing  broad  psychological 
principles.  The  technique  outlines  procedures  for  collecting  observed 
incidents  of  behavior  having  special  significance  and  meeting  systematically 
defined  criteria.  It  can  be  of  assistance,  therefore,  in  helping  to  deter- 
mine the  content  of  items  to  be  included  in  questionnaires.  Although  many 
articles  on  the  technique  have  been  published,  they  were  not  all  reviewed  in 
conjunction  with  preparing  this  review.  One  article  on  the  topic  was 
prepared  by  Barnes  (1960),  who  gave  an  historical  sketch  of  the  develop- 
ment of  the  technique , and  outlined  the  procedures  to  follow  in  using  this 
approach  for  social  research.  The  procedures,  representing  one  way  that 
the  critical  incident  technique  can  be  used,  included:  determining  the 

alms  of  the  investigation;  securing  competent  reporters  or  observers; 
collecting  the  critical  incidents  of  behavior  actually  observed;  selecting 
those  incidents  to  be  included  in  the  final  study;  analyzing  and  classifying 
the  data;  and  interpreting  the  findings. 

Another  method  for  selecting  items  for  an  attitude  scale  was  used  by 
Alilunas  (1949),  who  was  concerned  not  only  with  finding  out  what  people 
think  about  an  issue,  but  how  they  think  about  matters  on  which  they  are 
asked  to  give  an  opinion.  The  method  starts  with  asking  a group  of 
individuals  to  write  six  statements  giving  tneir  impressions  of  a topic, 
such  as  capitalism.  From  these,  some  smaller  number  of  statements  are 
selected  that  are  readable,  intelligible,  and  capable  of  classification. 
These  statements  can  then  be  sorted  into  several  categories,  such  as  the 
status  of  the  topic  and  its  good  and  bad  features. 
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Yet  another  way  of  developing  closed-ended  questionnaire  items  is  to 
evaluate  the  responses  to  corresponding  open-ended  items,  as  suggested  bv 
authors  such  as  Payne  (1965).  Reporting  on  a computer-assisted  method  of 
free  response  analysis  where  respondents  give  answers  they  think  appropriate 
and  then  rate  their  answers  according  to  dimensions  specified  on  a scale, 
Kohan,  deMille,  and  Myers  (1972)  stated  that  the  method  Identified  issues 
of  importance  that  had  been  overlooked  by  questionnaires  developers  either 
because  of  their  own  biases  or  Imperfect  knowledge.  They  also  noted  that 
reliance  on  the  conventional  method  may  distort  a study's  focus  by  obtaining 
data  on  items  not  of  real  concern  and  having  no  accurate  means  to  measure 
concern.  Also  on  the  topic  of  bias,  Schuessler  (1952)  questioned  the 
randomness  of  item  selection  in  scale  analysis.  He  showed  data  to  indicate 
that  differences  among  investigators'  definition  of  the  universe  and  bias 
in  selecting  items  both  effect  their  results.  He  concluded  that  much 
more  critical  attention  is  needed  by  the  researchers  to  avoid  their  own 
biases  and  influence  in  gathering  data  for  analyses. 

Hart,  Faust,  Rowland,  and  Lucier  (1964),  in  a report  on  attitudes  of 
troops  in  the  tropics,  noted  that  the  sentence  completion  technique  is 
useful  for  assessing  topic  and  dimension  saliency,  and  for  validating  the 
objective  techniques.  They  also  reported  that  a listing  technique  is 
valuable  for  identifying  salient  topic  dimensions  and  salient  topics,  and  for 
updating  instruments  which  are  developed  on  pilot  samples  and  used  on 
larger  populations.  They  feel  that  considerable  effort  should  be  exerted 
to  identify  the  salient  topical  dimensions,  their  levels,  and  their  inter- 
relationships whenever  an  objective  scaling  technique  is  used. 


Other  Considerations  Related  to  Questionnaire  Content 

This  i^ectlon  discusses  a niimber  of  diverse  topics,  all  of  which  are 
related  in  some  way  to  questionnaire  content. 

Five  obstacles  to  the  selection  of  appropriate  questions  to  test  social- 
psycholcgical  variables  were  discussed  by  Bradburn  (1970).  They  are; 

1.  Lack  of  agreement  among  behavioral  scientists  about  the  appropriate 
social-psychological  dependent  variables  that  are  relevant  to  particular 
social  programs; 

2.  An  inadequate  conceptualization  of  those  social-psychological 
variables  that  are  suggested  for  study; 

3.  A relative  lack  of  interest  in  systematic  methodological  research 
and  survey  measurement; 

4.  The  relative  underdevelopment  of  measurement  theory  in  survey  work; 

and 

5.  The  special  historical  and  cultural  problems  that  affect  the 
phraseology  of  questions. 


Among  the  principles  reported  by  Blankenship  (1942)  that  he  believes 
should  be  followed  in  the  wording  of  preference-type  questions,  those 
relating  to  questionnaire  content  are:  to  be  psychologically  sound,  a 

question  should  ask  about  past,  present,  or  future  behavior,  rather  than 
hypothetical  opinion;  the  questions  should  not  damage  the  pride  of 
respondents,  and  the  first  few  questions  used  must  secure  rapport  with  the 
respondent . 

The  fact  that  questionnaire  items  can  produce  variable  distortion  was 
pointed  out  in  the  report  of  a study  by  Klein,  Maher,  and  Dunnington  (1967). 
Items  dealing  with  salary  and  with  ratings  of  top  management  produced 
consistent  positive  distortions,  whereas  items  dealing  with  work  pressures 
and  the  respondent's  manager  produced  little  or  no  distortion  even  under 
conditions  of  high  threat.  Dunnette  and  Heneman  (1956)  also  noted  that 
a threat  to  anonymity  results  in  differential  amounts  of  response  distortion, 
depending  upon  the  content  of  different  items  comprising  the  questionnaire. 

Marquis,  Marshall,  and  Oskamp  (undated)  reported  on  a study  of  the 
accuracy  and  completeness  of  testimony  as  a function  of  kind  of  questions. 

They  found  that  for  items  of  low  salience,  structured  questioning  resulted 
in  more  compiet'.  but  less  accurate  responses.  However,  for  items  of  high 
salience,  more  structured  questioning  did  not  reduce  accuracy.  Similarly, 
Miklich  (1966)  found  that,  if  an  ambiguous  item  was  important,  the  tendency 
was  to  agree  with  it.  If  it  was  unimportant,  the  tendency  was  to  disagree. 

Two  studies  considered  the  reliability  of  various  types  of  questionnaire 
items  as  a function  of  content.  Cavan  (1933)  concluded  that  questions  involving 
attitudes  or  estimates  have  lower  reliability  than  factual  questions,  and  that 
reliability  is  increased  by  avoiding  fine  detail.  Guber  and  Gerberich  (1946), 
on  the  other  hand,  found  that  factual  questions  showed  the  least  reliability. 

Finally,  Spector  (1957)  demonstrated  that  the  test  user's  values  and 
needs  do,  and  should,  enter  into  judgments  made  during  the  construction  and 
validation  of  an  attitude  test. 


Pros  and  Cons  of  Various  Types  of  Questionnaire  Items 

This  section  presents  the  pros  and  cons  of  various  types  of  question- 
naire items  as  obtained  from  the  literature  reviewed.  Included  are:  ranking 
items;  rating  scale  items;  multiple  choice  items;  forced  choice  and  paired 
comparison  items;  card  sorts;  semantic  differential  items;  and  other  types 
of  items.  As  appropriate,  comparisons  of  item  types  are  included,  except 
for  a comparison  between  open-ended  and  closed-ended  items  generally, 
which  was  discussed  in  Chapter  JI. 


Ranking  Items 


Comparison  of  ranking  and  rating  scales.  Five  articles  were  abstracted 
that  compared  ranking  and  rating  methods.  Bittner  & Rundquist  (1950)  described  the 


rank  comparison  rating  method, and  noted  that  comparisons  with  other  studies 
revealed  that  the  method  gives  results  closely  related  to  rank  comparison. 
Murphy,  Bailey,  and  Covell  (1954)  found,  in  judging  frozen  strawberries, 
that  rating  provided  better  discrimination  than  ranking  when  ten  judges 
were  used.  However,  Rennick,  Grupe , Reich,  and  Sewell  (1954)  found 
rankings  to  be  more  reliable  than  ratings  when  professional  staff  both 
ranked  and  rated  parents' descriptive  reports  of  their  children's  growth 
in  specific  character  attitudes.  Bartlett,  Heermann,  and  Rettig  (1960) 
found  that,  for  a single  judge,  the  ranking  and  paired  comparison  tech- 
niques were  superior  in  reliability  to  the  Likert,  graphic  rating,  and 
equal  appearing  intervals  techniques.  Kassarjian  and  Nakanlshi  (1967), 
however,  found  comparability  between  ranking  and  Likert -type  scaling 
based  on  reliabilities  and  inter-method  correlations  when  methods  were 
compared  for  the  selection  of  a brand  name  for  a ficticious  new  phonograph. 

Comparison  of  ranking  and  paired  comparisons.  There  appears  to  have 
been  contradictory  evidence  obtained  when  the  ranking  and  paired  comparisons 
methods  were  compared.  Wilkins  (1950)  , using  300  men  randomly  selected 
from  British  army  reception  centers,  found  that  the  two  methods  did  not 
yield  similar  results  when  the  importance  of  eight  characteristics  of 
jobs  were  considered.  The  observed  differences  did  not  appear  to  be 
systematic  or  biased,  although  characteristics  of  least  importance  varied  the 
most  between  methods.  Witroyl  & Thompson  (1953)  found  that  a paired  comparison 
questionnaire  was  a more  stable  measure  than  a partial  rank  order  form  of 
a social  acceptance  questionnaire  administered  to  about  80  sixth  grade 
students.  They  noted  that  this  may  be  due  to  the  larger  number  of  responses 
required  in  the  paired  comparison  form.  They  also  said  that  this  form  is  a 
more  sensitive  measure  of  the  status  of  individuals  in  the  middle  range 
of  the  acceptability  continuum,  and  offers  relatively  more  general 
measures  of  social  status.  The  partial  rank  order  scales  may  reflect 
more  personal  and  situational  factors.  Also  in  favor  of  paired  comparisons, 
Cohen  (1967)  suggested  that  the  ranking  of  stimuli  produces  a statistical 
artifact  that  can  be  corrected  and  controlled  by  paired  comparison 
analysis.  The  artifact  is  the  inability  of  ranking  to  detect  the  compara- 
tive position  of  each  stimulus  in  relation  to  each  other  stimulus.  Fenner, 
Homant , and  Rokeach  (1968)  compared  the  rank  order  and  paired  comparison 
methods  of  measuring  terminal  and  instrumental  values.  For  the  terminal 
values,  the  paired  comparison  reliability  was  significantly  higher,  while 
for  Instrumental  values  the  difference  was  not  significant,  the  trend  being 
in  the  opposite  direction.  The  authors  concluded,  however,  that  the  benefit 
of  the  paired  comparison  method  as  compared  with  the  rank  order  method  is 
doubtful.  The  results  suggested  that  the  paired  comparison  method  should 
be  employed  when  measuring  value  systems  only  if  there  is  a principal  concern 
with  the  terminal  values  and  if  the  time  and  effort  expended  in  testing, 
scoring,  coding,  etc.,  are  not  important  considerations.  Bernard  (1933) 
had  come  to  a similar  conclusion,  noting  that  the  method  of  ranking  is  not 
inferior  in  reliability  to  that  of  paired  comparisons.  He  also  stated  that, 
since  it  took  twice  as  long  for  the  judges  to  complete  the  paired  comparisons 
as  the  ranking,  the  latter  was  the  superior  method. 
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The  results  from  three  other  studies  found  essentially  no  differ- 
ences between  the  methods  of  ranking  and  paired  comparisons.  Eng  and 
French  (1948),  comparing  sociometric  and  psychological  methods  of  scaling, 
found  a near  perfect  correlation  between  mean  ranks  and  paired  comparisons. 
Kassarjian  and  Nakanishi  (1967),  in  the  study  noted  in  the  previous  section, 
also  found  comparability  between  ranking,  paired  comparisons,  and  open 
choice  preferences.  Slater  (1965)  also  found,  from  four  experiments, 
comparability  of  ranking,  paired  comparisons,  and  other  forced  choice 
comparisons  for  recording  personal  preferences.  He  concluded  that  the 
whole  weight  of  the  evidence  is  in  favor  of  the  view  that  an  informant, 
when  expressing  his  personal  preferences,  tends  to-maintain  a level  of 
reliability  which  characterizes  him  as  an  individual,  and  is  unaffected 
either  by  variations  in  the  number  of  objects  he  is  given  to  compare  or 
changes  in  the  methods  he  is  asked  to  use. 

The  relationship  between  ranking  and  the  method  of  paired  comparisons 
was  reported  by  Ross  (1955)  and  Pauli  (1968).  Ross  showed  that,  when  N 
judges  are  asked  to  indicate  their  preferences  for  n items  by  both  the 
method  of  paired  comparisons  and  the  method  of  rank  order,  a linear 
relationship  holds  between  the  total  number  of  choices  from  the  paired 
comparison  method  and  the  mean  rank  from  the  rank  order  method.  Pauli 
(1968)  studied  the  reliability  of  results  obtained  by  the  psychophysical 
methods  of  rank  ordering  and  paired  comparisons  when  subjects  are  ego 
involved  in  the  material  being  judged.  He  found  that  scales  derived  by 
the  two  methods  are  linear  in  relationship. 


Rating  Scale  Items 

Comparison  of  rating  scale  and  multiple  choice  items.  Only  one  study 
comparing  rating. scale  and  multiple  choice  items  is  reported  here,  since  a 
majority  of  such  studies  involve  issues  regarding  the  number  of  response 
alternatives  to  employ  and  are  hence  discussed  below  in  the  first  section 
of  Chapter  VI.  Greenwald  and  O'Connell  (1970)  conducted  a study  to  test 
previous  findings  that  suggested  that  dichotomous  measures  yielded  similar 
but  not  equivalent  information  to  that  of  Likert  scales.  The  results  showed 
that,  as  in  previous  studies,  the  true-false  and  Likert  methods  correlated 
significantly.  However,  the  Likert  format  produced  the  higher  item-total 
correlations.  The  greater  internal  consistency  for  the  Likert  approach 
suggested  a possible  advantage  for  Likert  scaling. 

Comparison  of  rating  scale  items  and  forced  choice  or  paired  comparison 
items.  In  this  chapter,  forced  choice  and  paired  comparison  items  are 
discussed  together  since  the  latter  is  but  a special  case  of  the  former, 
using  duads. 
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A study  by  Pilgrim  and  Wood  (1955)  compared  the  sensitivity  of  rating 
stale  and  paired  comparison  methods  for  measuring  consumer  food  preferences 
under  laboratory  conditions.  The  methods  were  found  to  be  equally  sensitive 
whether  the  differences  in  preference  were  large  or  small.  Similarly, 
Greenberg  (1963)  found  no  significant  differences  between  rating  scale  and 
paired  comparison  tests  used  in  consumer  product  tests. 

In  the  attitude  measurement  area,  Neidt  and  Merrill  (1951)  compared 
five  point  rating  scales  and  paired  (positive  and  negative)  statements. 

They  found  that  each  showed  about  equal  validity  coefficients.  Although 
the  reliability  of  the  rating  scale  was  somewhat  higher  than  that  of  the 
paired  comparison  form,  the  authors  felt  that  there  are  advantages  to 
the  latter  which  warrant  its  consideration  under  some  circumstances. 

Horst  and  Wright  (1959)  also  obtained  higher  reliability  for  a self-appraisal 
personality  rating  scale  than  for  a paired  comparison  inventory  composed  of 
the  same  items,  although  the  rating  scale  scores  were  arithmetically  ipsa- 
tized.  (See  Chapter  XI  for  a discussion  on  the  properties  and  uses  of 
ipsative  scores.)  The  rating  scale  also  required  only  about  one-third  the 
time  to  administer  than  the  paired  comparison  test. 

A personality  questionnaire  and  forced  choice  personality  test  were 
compared  by  Gordon  (1951).  Both  had  the  same  factor  structure  and  much 
the  same  item  content,  and  were  constructed  by  the  method  of  internal 
consistency.  For  all  four  personality  scales  the  forced  choice  method  was 
found  to  be  more  valid  than  the  questionnaire  method,  using  descriptive 
nominations  by  associates  as  the  criterion.  Multiple  correlations  indicated 
that  the  questionnaire  data  added  nothing  towards  the  prediction  of  the 
criteria  when  placed  in  a battery  with  the  forced  choice  test. 

Scott  (1968)  did  a study  of  the  comparative  validities  of  self-report 
forced  choice  and  single  stimulus  tests.  He  noted  that  the  generalization 
that  forced  choice  personality  inventories  are  more  valid  than  single 
stimulus  forms  of  the  same  tests  was  not  supported  by  initial  examination 
of  the  relevant  evidence.  Apparently  only  one  study  that  claimed  superior 
validity  for  the  forced  choice  format  appeared  to  have  used  identical 
items  in  the  two  forms.  Other  studies  either  did  not  use  single  stimulus 
forms  for  comparison,  did  not  hold  item  content  constant  between  the  two 
forms,  or  else  yielded  nonconfirming  results.  He  reported  also  that  the 
most  tenable  conclusion  is  that  test  validity  does  not  depend  on  this 
characteristic  of  item  format  under  the  circumstances  in  which  these  self- 
report  inventories  are  typically  administered. 

Newhall  (1954)  compared  the  methods  of  paired  comparison  and  single 
stimuli  in  the  evaluation  of  a series  of  color  prints  and  color  transparencies. 
The  two  methods  produced  highly  correlated  results.  However,  the  method  of 
single  stimuli  was  preferred  as  being  the  more  efficient  method  in  making 
judgments  where  items  do  not  require  juxtaposition. 
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All  of  the  studies  to  be  reviewed  in  the  rest  of  this  section  Involved 
the  use  of  judges  or  raters,  and  hence  may  not  be  comparable  to  studies 
based  upon  self-report.  Using  three  groups  of  judges,  the  paired  comparison 
and  equal  appearing  intervals  methods  of  scaling  attitude  statements  toward  teach- 
ing were  compared  empirically  by  Crawford  (1965).  He  said  that  the  two  methods 
appeared  comparable,  at  least  when  expert  judges  were  used.  The  use  of 
students  as  judges  in  this  type  of  study  was  questioned.  Students  rated 
occupational  status  in  the  study  by  Bartlett,  Heermann,  and  Rettig  (1960), 
where  little  difference  in  scale  values  or  reliability  for  mean  scale 
values  was  found  using  the  paired  comparisons,  Likert,  graphic  rating, 
and  equal  appearing  intervals  techniques.  The  paired  comparison  and 
ranking  techniques  were,  however,  found  to  be  superior  in  reliability  for 
a single  judge.  Using  85  subjects  to  judge  the  esthetic  value  of  seven 
handwriting  specimens,  Ekman  and  Kunnapas  (1960)  constructed  an  interval 
scale  by  the  method  of  paired  comparisons  and  a ratio  scale  by  a variant 
of  the  method  of  ratio  estimation.  They  both  gave  essentially  the  same 
results. 

A graphic  rating  scale  was  compared  with  six  kinds  of  forced  choice 
forms  for  rating  Air  Force  technical  Instructors,  by  Berkshire  and  Highland 
(1953).  Scores  from  the  graphic  rating  scale  exhibited  relatively  little 
bias  and  had  as  high  validity  as  the  best  of  the  forced  choice  scales. 

Combining  the  scores  from  the  graphic  and  forced  choice  scales  yielded 
validity  coefficients  substantially  higher  than  for  either  alone.  The  use 
of  forced  choice  items  and  both  eight  and  five-step  graphic  scales  was 
compared  in  a study  conducted  by  the  U.S.  Department  of  the  Army  (1952)  using 
400  of f icers as  a rater-ratee  population.  The  eight-step  graphic  scale  had 
the  highest  validity  (.53).  A study  by  Staugas  and  McQuitty  (1950), 
however,  found  the  forced  choice  method  superior  to  the  use  of  a graphic 
scale.  But  Bayroff,  Haggerty,  and  Rundquist  (1954)  found  that  two  types 
of  graphic  rating  scales  and  two  modifications  of  the  forced  choice  tech- 
nique did  not  differ  markedly  in  validity. 

Susceptibility  to  errors  was  the  concern  of  two  other  investigations. 
Leftwich  and  Remmers  (1962)  compared  graphic  and  forced  choice  (tetrad) 
ratings  of  teacher  performance.  Distributions  and  intercorrelations  of 
mean  item  and  mean  total  scores  showed  the  graphic  form  relatively  more 
susceptible  to  errors  of  leniency  and  halo.  Item  intercorrelations  were 
also  higher  in  general  for  the  graphic  form.  The  authors  noted,  in  addition, 
that  the  forced  choice  form  was  susceptible  to  fakability,  relative  to  the 
transparency  of  any  forced  choice  tetrad.  Bartlett  and  Sharon  (1969) 
determined  the  effects  of  several  instructional  rating  conditions  on  leniency 
on  a graphic  and  forced  choice  rating  scale.  A significant  leniency  effect 
was  found  with  the  graphic  ratings  which  were  to  be  used  for  evaluation 
purposes  and  those  which  had  to  be  justified  to  the  ratee,  but  presumably 
not  with  those  that  were  anonymous  or  were  identified  by  having  the  rater 
place  his  name  on  the  form.  It  was  concluded  that  the  forced  choice  scale 
was  quite  resistant  to  leniency  bias. 


Comparison  of  rating  scale  items  and  card  sorts.  Two  studies  compared 
the  use  of  rating  scales  and  card  sorts,  both  in  terms  of  determining  scale 
values.  Seashore  and  Hevner  (1933)  substituted  a nine  point  scale  for  each 
item  for  the  standard  method  of  sorting  items  into  nine  piles  from  separately 
printed  slips.  The  rating  method  saved  87%  of  the  time  in  assembling 
materials  and  50%  of  the  time  in  tabulating  results  Involved  in  making 
attitude  scales  by  Thurstone's  method  of  equal  appearing  intervals.  The 
subjects  found  the  task  easier  and  more  pleasant,  and  the  results  showed 
negligible  differences  in  the  medians  or  scale  values  of  the  items,  and 
in  the  difference  or  spread  of  opinion  (Q  value)  in  regard  to  them. 

An  investigation  of  the  stability  of  median  and  Q values  computed  from 
graphically  derived  and  from  sorted  judgments  used  in  scaling  by  the  method 
of  equal  appearing  intervals  was  conducted  by  Siegel  and  Siegel  (1962). 

Graphic  judgments  using  a nine  point  scale  tended  to  yield  higher  Q values 
than  nine  pile  sorts  for  relatively  unambiguous  items.  The  medians  derived 
from  the  two  procedures  correlated  .97. 

Comparison  of  rating  scale  and  semantic  differential  items.  In  the 
study  by  Hart,  Faust,  Rowland,  and  Lucier  (1964)  on  the  attitudes  of  troops 
in  the  tropics,  it  was  concluded  that  Osgood's  semantic  differential  tech- 
nique was  clearly  superior  to  Likert's  agree/disagree  method  of  summated 
ratings.  They  went  on  to  note  that,  for  most  purposes,  attitudinal  data 
collection  efforts  in  which  objective  questionnaires  are  used  should  consist 
of  some  form  of  the  semantic  differential  scaling  technique  as  opposed  to 
agree/disagree  versions  of  Likert's  method  of  summated  ratings. 

Hughes  (1967)  compared  the  use  of  Thurstone  and  modified  semantic 
differential  scales  (with  a "no  information"  category)  in  a questioiinaire. 

None  of  the  Thurstone  scales  detected  attitude  change,  but  28%  of  the 
semantic  differential  scales  did.  Test-retest  reliability  was  .53  for  Thurstone, 
.58  for  the  semantic  differential.  In  addition,  the  semantic  differential 
increased  in  preference  as  the  respondents  became  used  to  it. 

Ward  (1969),  questionning  the  results  of  a previous  study,  found  that 
the  semantic  differential  is  no  more  vulnerable  to  changes  in  issue  saliency 
than  are  other  widely  used  measures  of  attitude. 

Comparison  of  rating  scale  and  check  list  items.  The  study  by  Hughes 
(1967)  referred  to  Immediately  above  also  included  adaptations  of  a check 
list  (e.g. , with  important,  unimportant,  and  no  opinion  categories)  on  the 
questionnaire  employed.  Eleven  percent  of  the  check  list  scales  detected 
attitude  change,  and  the  check  list  items  had  a test-retest  reliability  of 
.58,  the  same  as  the  semantic  differential  items. 

Likert-type  scaling  was  compared  with  the  use  of  various  types  of 
check  lists  by  Kassarjian  and  Nakanishi  (1967),  and  no  differences  in 
results  were  found.  In  the  Department  of  Army  study  (1952)  referred  to 
previously,  the  eight-step  graphic  rating  scale  was  also  found  to  have 
higher  validity  (.53)  than  when  a controlled  check  list  was  used  (.44). 

The  four  five-step  scales  had  validities  ranging  from  .39  to  .44. 
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Multiple  Choice  Items 


As  used  here,  multiple  choice  items  include  true-false  and  yes-no, 
and  similar  dichotomous  items  as  special  cases.  Generally,  studies  having 
to  do  with  right/wrong  responses  were  excluded  from  the  literature  review, 
unless  they  appeared  to  have  direct  relevance  to  the  use  of  multiple  choice 
items  in  questionnaires.  Comparisons  of  rating  scale  and  multiple  choice 
items  were  reported  above. 

Some  issues  related  to  the  use  of  multiple  choice  items.  A number 
of  issues  appeared  in  the  literature  related  to  the  use  of  multiple  choice 
items.  Those  not  more  appropriately  discussed  in  other  chapters  are 
reviewed  here. 

Swordes  (1952)  discovered  that  in  a test  using  items  with  both  four 
and  five  choices,  a number  of  those  taking  the  test  marked  the  fifth  space 
on  the  answer  sheet  when  the  question  had  only  four  possible  responses. 

It  was  concluded  that  certain  preventions  should  be  taken  to  reduce  the 
undesirable  results  of  using  a different  number  of  distractors  in  the  same 
examination.  These  include  special  instructions,  reduction  of  the  number 
of  alternate  groups,  and  restriction  of  a varying  number  of  choices  to  the 
more  capable  test  takers,  when  practicable. 

Hosier  and  Price  (1945)  presented  a means  to  overcome  the  usual  problem 
in  multiple  choice  items  construction:  the  arrangement  of  the  response 

alternatives.  They  used  a table  in  which  the  120  permutations  of  the 
numbers  one  through  five  had  been  randomized.  They  pointed  out,  however, 
that  such  a table  should  not  be  used  when  the  response  alternatives  form 
a logical  pattern. 

Cronbach  ( 194 la)  compared  multiple  choice  and  multiple  true- 
false  tests,  with  instructions  to  guess.  He  found  little  significant 
differences  between  them.  However,  the  multiple  choice  type  of  test  had 
slightly  higher  reliability  and  seemed  slightly  easier  to  score.  Hence, 
evidence  from  the  study  supported  the  use  of  the  multiple  choice  rather 
than  multiple  true-false  form,  if  omissions  are  not  expected. 

Data  presented  by  Knowles  (1963)  demonstrated  that  questionnaires  of 
the  true-false  type  can  be  differentially  prone  to  acquiescence  response 
set.  This  topic  is  discussed  in  detail  in  Chapter  X. 

It  was  noted  by  Tuckman  and  Lorge  (1953)  that  graduate  students 
experienced  frustration  because  of  the  either/or  choice  when  circling  a 
yes  or  no  response  when  asked  whether  they  generally  agreed  or  disagreed 
with  statements  about  older  people.  Hence,  the  authors  conducted  a study 
where  the  same  questions  were  used  but  where  the  response  was  the  per- 
centage of  older  people  for  whom  the  question  would  apply.  No  significant 
differences  were  found  between  the  two  methods,  causing  the  authors  to 
conclude  that  the  yes-no  method  was  preferred  due  to  its  scoring  ease. 


Comparison  of  multiple  choice  and  forced  choice  or  paired  comparison 
items . Appel  (1959)  administered  a 72  item  true-false  questionnaire  and 
a parallel  forced  choice  questionnaire  consisting  of  24  triads.  The 
content  of  the  items  on  each  form  was  identical.  Based  upon  the  forecasted 
validities  for  the  best  keys  of  the  two  forms  for  an  Infinite  number  of 
items,  Appel  concluded  that, for  longer  forms, the  forced  choice  method  is 
likely  to  result  in  greater  validity,  while  for  shorter  forms  the  true- 
false  method  is  likely  to  prove  superior. 

Osburn,  Lubin,  Loeffler,  and  Tye  (1954)  compared  the  relative  validity 
of  forced  choice  and  single  stimulus  yes-no  self-description  items.  The 
contents  of  the  items  that  were  compared  were  identical.  No  significant 
differences  were  found,  but  the  results  seemed  to  suggest  that  the  choice 
of  format  would  depend  upon  the  number  of  items  available  and  their 
statistical  characteristics. 

A forced  choice  form  of  an  Interest  inventory  was  compared  with  a 
like-indifferent-disLike  form  by  Perry  (1955),  using  the  same  items  for 
groups  of  Navy  yeomen  and  college  students.  Unit  weight  and  multiple 
weight  keys  were  developed  for  each  inventory  to  differentiate  yeoman  from 
students.  The  forced  choice  keys  were  superior  in  separating  groups  in 
seven  of  ten  comparisons.  However,  there  was  tittle  difference  in  validity 
shrinkage  for  the  two  kinds  of  items. 

Comparison  of  multiple  choice  items  and  card  sorts.  Van  Der  Veen, 
Howard,  and  Austria  (1970)  compared  response  formats  of  Q-sort,  multiple 
choice,  and  true-false  methods  according  to  test-retest  reliability  and 
scoring  characteristics.  The  analyses  suggested  that  all  three  forms  were 
reliable  in  test-retest  situations.  Both  the  multiple  choice  and  Q-sort 
methods  showed  high  stability.  However,  the  former  showed  some  variance 
for  social  desirability.  The  true-false  method  was  found  psychometrically 
inferior,  showing  lower  stability  and  some  social  desirability  variance. 

The  authors  concluded  that  the  Q-sort  is  the  format  of  choice  if  testing 
time  is  available,  otherwise  the  multiple  choice  format  should  be  used. 

Comparison  of  multiple  choice  and  open-ended  items.  Two  articles 
compared  multiple  choice  and  open-ended  items.  In  the  first,  Rugg  and 
Cantril  (1942)  examined  the  form  of  the  question  in  public  opinion  polls 
by  using  multiple  choice,  dichotomous  choice,  and  free  response  formats. 
Through  five  different  polls  they  reached  the  conclusion  that  in  all  cases 
no  one  method  was  best.  Multiple  choice  gave  accurate  placement,  while 
dichotomous  was  simply  scored.  In  addition,  free  response  gives  respondents 
the  most  freedom  of  expression. 

In  the  second  study,  Gustav  (1964)  compared  responses  to  a questionnaire 
concerning  methods  of  study  and  preferences  for  true-false,  multiple  choice, 
and  essay  questions,  with  actual  test  scores  for  102  undergraduates.  True- 
false  items  were  liked  least.  A large  proportion  of  the  group  reported 
they  studied  differently  for  particular  types  of  examinations,  and  slightly 
more  than  half  believed  they  do  equally  well  on  all  types  of  tests  despite 
any  preferences. 
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Forced  Choice  and  Paired  Comparison  Items 


This  section  begins  with  a review  of  some  Issues  related  to  forced 
choice  and  paired  comparison  questionnaire  items.  Forced  choice  and 
paired  comparison  items  are  next  compared  with  card  sorts  and  then  with 
check  lists.  They  were  compared  with  ranking  items,  rating  scales,  and 
multiple  choice  items  in  earlier  sections  of  this  chapter. 

Some  issues  related  to  forced  choice  and  paired  comparison  items.  In 
a paper  presented  by  the  Personnel  Research  Section,  PRPB , Adjutant-General's 
Office  (1946),  it  was  noted  that  the  utility  of  rating  scales  for  predictive 
purposes  or  administrative  action  had  been  limited  by  the  ease  with  which 
a rater  could  determine  accurately  where  he  was  placing  a person  on  a scale. 

A technique,  the  forced  choice  method,  which  reduces  the  rater’s  ability 
to  control  the  final  result  of  his  ratings,  was  described.  It  was  noted 
that  the  essence  of  the  forced  choice  technique  is  to  force  the  rater  to 
choose  between  descriptive  phrases  which  appear  of  equal  value  (have  the 
same  preference  index)  but  are  different  in  '’alidity  (discrimination  index). 

The  major  problem  is  the  grouping  of  alternatives  to  achieve  these  ends. 

The  preference  index  is  the  mean  of  the  scale  indicating  the  degree  to 
which  the  phrase  applies  to  the  group  concerned,  while  the  discrimination 
index  represents  the  correlation  of  the  descriptive  phrase  and  an  overall 
rating. 

As  noted  by  Buel  (1963),  in  the  construction  of  a forced  choice  scale 
the  preference  and  discrimination  indices  are  usually  derived  from  responses 
to  items  in  check  list  form.  The  items  are  then  grouped  on  the  basis  of 
similar  preference  index  but  dissimilar  discrimination  index.  It  is  assumed 
that  the  preference  value  of  an  item  does  not  change  when  it  is  transferred 
from  its  position  in  the  check  list  form  to  a position  in  the  forced  choice 
form.  In  his  study,  Buel  (1963)  found  that, while  the  preference  index 
values  of  only  a few  items  changed,  such  shifts  generally  hau  the  effect 
of  reducing  the  discrimination  index  values.  Waters  and  Wherry  (1961b) 
also  investigated  the  stability  of  preference  index  values  from  check 
list  to  forced  choice  administration,  and  found  a high  degree  of  stability. 
Berkshire  and  Highland  (1953)  reported  that  a favorableness  index  fits  into 
the  forced  choice  rationale  better  than  does  the  preference  index.  Bartlett 
(1960)  made  comparisons  between  the  two,  and  concluded  that, if  for  practical 
reasons  only  one  index  is  used  for  matching,  the  preference  index  appears 
to  be  the  better. 

Two  studies  that  were  reviewed  considered  the  failure  to  adequately 
match  the  forced  choice  items.  Bartlett  (1960)  used  a scale  where  the 
items  within  each  set  were  not  perfectly  matched  on  preference,  discrimination, 
favorableness,  general  factor  loading,  and  magnitude  of  group  factor  loading. 
Multiple  correlations  indicated  that  about  half  of  the  variance  of  rating 
response  for  both  peer  and  self-ratings  could  be  explained  by  failure  to 
match  on  these  five  indices.  Eisenberg  (1965)  found  significantly  lower 
scores  when  a form  was  used  with  items  not  matched  on  preference  index 
and  different  on  discrimination  index,  compared  to  an  identical  form 
developed  along  classical  lines. 


Zavala  (1965),  in  his  review  of  the  development  of  the  forced 
choice  technique,  pointed  out  that  the  reliabilities  and  validities  of  the 
technique  compare  favorably  with  other  methods,  and  that  studies  have 
shown  that  the  method  is  more  resistant  than  other  scales  to  effects  of 
bias.  Earlier,  however,  Travers  (1951)  conducted  a critical  review  of  the 
validity  and  rationale  of  the  forced  choice  technique  and  noted  that,  as 
used  in  officer  efficiency  reports,  the  evidence  did  not  support  claims 
made  for  the  validity  of  the  procedure.  He  also  concluded  that  the  high 
validity  coefficients  secured  must  be  considered  to  be  largely  spurious 
until  they  are  demonstrated  to  be  otherwise.  As  noted  in  an  earlier  section, 
Scott  (1968)  similarly  concluded  that  the  generalization  that  self-report 
forced  choice  personality  inventories  are  more  valid  than  single  stimulus 
forms  of  the  same  tests  was  not  supported  by  critical  consideration  of 
the  relevant  evidence. 

In  the  study  by  Berkshire  and  Highland  (1953)  , six  kinds  of  forced 
choice  formats  were  compared  for  rating  Air  Force  technical  instructors 
under  experimental  conditions  and  under  instructions  to  give  as  high  a 
score  as  possible.  The  results  for  the  six  forms  were: 

1.  Form  A:  Two  statements  per  block,  both  favorable  or  both 

unfavorable,  choose  the  more  descriptive  or  the  least  descriptive.  Had 
relatively  hign  reliabilities  and  validities,  was  one  of  the  two  best  liked, 
but  was  markedly  unsatisfactory  in  its  failure  to  resist  leniency  effects. 

Was  also  uneconomic  in  that  over  half  of  the  blocks  failed  to  discriminate 
when  subjected  to  item  analysis. 

2.  Form  B:  Three  statements  per  block,  all  favorable  or  unfavorable, 

choose  the  most  and  least  descriptive  statements  in  each  block.  Low  in 
validity,  lowest  in  reliability,  least  liked  by  the  raters,  and  uneconomic. 

Was,  however,  resistant  to  skewing  under  instructions  to  bias. 

3.  Form  C:  Four  statements  per  block,  all  favorable,  choose  the 

two  most  descriptive  statements.  Most  bias  resistant,  yielded  consistently 
high  validities  under  various  conditions,  was  one  of  the  two  best  liked, 
and  had  adequate  reliability.  This  method  was  superior  to  the  other  methods 
tested. 

4.  Form  D;  Four  statements  per  block,  all  favorable,  choose  the  most 
and  least  descriptive  statements.  Comparable  to  Form  C in  reliability  and 
validity,  but  was  more  susceptible  to  leniency  effects  and  less  well  liked. 

5.  Form  E:  Four  statements  per  block,  two  favorable  and  two  unfavorable 
in  appearance,  choose  the  most  and  least  descriptive  statements.  An  inadequate 
method,  easily  biased,  low  validity,  and  not  as  well  liked  as  Forms  A,  C,  and  F. 

6.  Form  F:  Five  statements  per  block,  two  of  which  were  favorable, 

one  neutral,  and  two  unfavorable  in  appearance,  choose  the  most  and  least 
descriptive.  Too  easily  biased  for  use.  Was  moderately  well  liked,  but 
was  exceeded  in  validity  by  Forms  A,  C,  and  D. 


Agreeing  with  Berkshire  and  Highland,  Zavala  (1965)  also  noted  that 
formats  using  four  favorable  Items, from  which  the  rater  chooses  the  Items 
most  characteristic  of  the  person  rated,  proved  superior  to  other  formats. 

He  said  that  this  superiority  appeared  In  validities,  reliabilities,  and 
preferences  of  raters  using  the  form. 

In  other  studies  related  to  forced  choice  format,  Zuckerman  (1952) 
found  that  the  llke-lndlfferent-dlsllke  arrangement  of  self-report  Interest 
Inventories  was  clearly  superior  to  the  two  choice  form.  Waters  and  Wherry 
(1961a) reported  on  the  effect  of  response  format  on  subject  resistance  to 
a forced  choice  self-rating  scale.  The  subjects  .were  found  to  be  more 
favorable  toward  a response  format  allowing  them  to  Indicate  the  degree 
of  applicability  of  each  statement  In  the  forced  choice  pair,  even  though 
they  were  still  forced  to  choose  one  statement  as  relatively  more  applicable. 
Waters  (1966)  also  found  that  reaction  to  a forced  choice  scale  was  more 
favorable  when  some  method  was  Incorporated  whereby  the  subject  was  given  an 
opportunity  to  Indicate  the  degree  of  applicability  of  each  Item  to  himself. 

The  effects  of  partial  pairings  was  Investigated  In  two  studies. 

McCormick  and  Bachus  (1952)  conducted  a study  to  determine  the  extent  to 
which  It  would  be  possible.  In  paired  comparison  ratings  of  employees,  to 
use  reduced  numbers  of  pairings  and  still  achieve  essentially  the  same 
rating  results  as  would  be  obtained  from  a complete  pairing  of  all  Individuals 
within  a group.  The  results  showed  that  ratings  obtained  from  partial 
pairings  resulted  In  fairly  high  correlations  with  ratings  based  on  complete 
pairings.  The  correlations  were  reduced  rather  systematically  with  reductions 
in  the  number  of  pairs  per  individual  on  which  the  ratings  were  based.  In 
a follow-up  article,  McCormick  and  Roberts  (1952)  reported  that  the  relia- 
bility of  ratings  obtained  with  partial  pairings  also  tended  to  decrease  rather 
systematically  with  reductions  In  the  number  of  pairs  per  Individual  on 
which  the  ratings  were  based.  However,  for  groups  of  50  Individuals,  ratings 
based  on  as  few  as  16  pairs  per  Individual  appeared  to  be  relatively  reliable. 

As  noted  above,  Zavala  (1965)  reported  that  studies  on  the  forced 
choice  method  showed  It  to  be  more  resistant  than  other  scales  to  effects 
of  bias.  As  will  be  discussed  In  Chapter  K,  the  forced  choice  method 
has  been  used  by  a number  of  Investigators  In  an  attempt  to  control 
the  tendency  of  Individuals  to  answer  self-report  Inventories  In  terms  of 
response  sets  rather  than  giving  "true"  responses.  For  example,  Jackson 
and  Minton  (1963)  concluded  that  combining  Items  Into  scales  and  casting 
them  Into  a paired  comparison  context  Is  the  method  of  choice  In  constructing 
adjective  check  lists  for  personality  assessment.  This  conclusion  was 
based  upon  the  effects  of  the  forced  choice  format  In  enhancing  content 
reliability  and  eliminating  the  massive  response  set  to  check  many  or 
few  Items  on  the  check  list.  Howe  (1960),  however,  working  with  anxiety 
scales,  reported  that  data  concerning  rellabllltj-  and  skewness  did  not 
give  an  unequivocal  Impression  that  the  forced  choice  format  reduces  the 
tendency  to  give  socially  desirable  responses.  Feldman  and  Corah  (1960) 
also  reported  that  social  desirability  Is  not  minimized  by  the  forced 
choice  format.  Braun  (1969)  pointed  out  that  there  Is  no  effective  control 
for  social  desirability  of  axternatlves  presented,  nor  for  fake-proof 
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devices.  Lederman  (1971)  interpreted  data  as  showing  that  the  forced 
choice  format  cannot  prevent  subjects  from  presenting  a more  favorable 
image  of  themselves  if  they  choose  to  do  so,  but  that  the  problem  is  usually 
less  in  the  forced  choice  than  in  the  questionnaire  format. 

Studies  of  the  results  of  asking  subjects  to  fake  their  responses 
have  been  conducted  by  a number  of  authors,  Including  Izard  and  Rosenberg 
(1958)  and  Eisenberg  (1965).  Izard  and  Rosenberg  used  a forced  choice 
personality  test  with  naval  aviation  cadets.  They  found  that  forced 
choice  scores  under  instructions  for  a "set  to  fake"  did  not  significantly 
differ  from  regular  scores,  suggesting  that  the  t^jst  is  not  easily  suscept- 
ible to  faking.  Eisenberg  (1965),  however,  found  that  instructions  to  fake 
did  affect  results  when  a forced  choice  format  developed  in  the  classical 
manner  was  used. 

Comparison  of  forced  choice  items  and  card  sorts.  In  a study  by  Turgut 
(1963),  a paired  comparison  format  and  a modified  Q-sort  were  compared  for 
efficiency  in  personality  measurement.  The  experiment  tested  reliabilities 
per  unit  of  testing  time  and  acceptability  to  the  examinees.  The  internal 
consistency  coefficients  of  the  paired  comparison  format  averaged  .77.  The 
average  for  the  Q-sort  was  .73  when  corrected  for  the  average  time  spent. 
Subjects'  reaction  to  the  formats  were  measured  by  a rating  scale,  and 
showed  that  57^»  liked  the  paired  comparison  form  and  32°4  liked  the  Q-sort. 

Comparison  of  forced  choice  items  and  check  lists.  Forced  choice  or 
paired  comparison  items  were  compared  with  check  lists  in  two  studies. 

In  the  U.S.  Department  of  Army  study  (1952)  previously  referred  to,  a 
controlled  check  list  with  24  items  where  the  12  most  descriptive  were  to 
be  selected  was  used  in  addition  to  forced  choice  pairs.  The  validities 
of  the  forced  choice  pairs,  based  on  rankings  by  approximately  20  class- 
mates, was  .41;  of  the  controlled  check  list,  .44. 

Merenda  and  Clarke  (1963)  compared  two  self-rating  adjective  check 
lists.  The  first  was  the  regular  free  response  list,  the  second  a forced 
choice  version  where  the  adjectives  were  arranged  in  tetrad  sets.  Ipsative 
scoring  (discussed  in  Chapter  XI)  was  used.  The  results  suggested  that 
the  forced  choice  method  is  likely  to  be  inappropriate  for  use  with  adjective 
check  lists  in  self-concept  assessment. 


Card  Sorts 


The  advantages  of  using  card  sorts  for  acquiring  racing  information 
on  any  issue  has  been  discussed  by  a number  of  authors,  including  Dubois 
(1949-50).  The  most  extensive  discussion  of  the  use  of  card  sorts  (or,  more 
generally,  Q-technique  and  its  methodology)  probably  appears  in  The  Study 
of  Behavior  by  William  Stephenson  (1953).  Card  sorting  as  a technique  for 
survey  interviewing  was  discussed  by  Cataldo,  Johnson,  and  Kellstedt  (1970), 


111-14 


J 


who  assessed  its  reliability,  validity,  and  response  bias  and  the  reactions 
of  respondents  and  interviewers.  They  concluded  that  card  sorting  is  a 
fast  and  interesting  method  of  obtaining  valid  and  reliable  interview  data, 
and  one  which  appears  to  be  capable  as  well  of  counteracting  at  least  some 
of  the  biasing  effects  of  response  set. 

Four  articles  were  abstracted  that  addressed  the  issue  of  whether  Q- 
sorting  procedures  should  allow  a free  or  unforced  sort  where  the  subject 
is  allowed  to  place  as  many  cards  as  desired  within  the  sorting  intervals, 
or  require  a forced  sort  where  a predetermined  number  of  items  have  to 
be  placed  in  each  interval  cell.  Block  (1956)  compared  forced  and  unforced 
Q-sorting  procedures  using  76  items  with  11  sorters.  The  forced  sort  seemed 
to  offer  more  stability  and  slightly  more  discrimination  than  unforced 
sorting.  Gaito  (1962),  considering  statistical  and  non-statistical  aspects 
of  Q-sorting,  concluded  that  severe  defects  appeared  present  for  various 
analysis  tests  of  significance  when  forced  sorting  was  involved;  moderate 
distortion  when  the  free  sort  was  used.  Hess  and  Hink  (1959)  also  compared 
the  forced  and  free  Q-sort  procedures,  and  found  that  the  two  types  of 
administration  gave  similar  results  when  the  identical  Q-sort  was  used  with 
adolescents.  A similar  conclusion  was  reached  by  Brown  (1971).  He  noted 
that  arguments  favoring  free  over  forced  Q-sorts  had  assumed  that  forcing 
leads  to  the  loss  of  important  statistical  Information  and  interferes  with 
interval  properties,  rendering  Pearson's  _r  inappropriate  for  analysis.  He 
found  that  Q-sorts  with  identical  item  orderings  but  with  varied  distri- 
butions provided  essentially  the  same  correlations  and  factor  structures 
when  coefficients  were  computed  using  Spearman's  rg,  Kendall's  r,  and 
Pearson's  Hence,  he  concluded  that  the  same  results  are  obtained  despite 

distribution  and  whether  interval  or  ordinal  statistics  are  used. 

In  previous  sections  of  this  chapter, card  sorts  were  compared  with 
rating  scales,  multiple  choice  items,  and  forced  choice  items. 


Semantic  Differential  Items 


This  section  reviews  some  of  the  pros  and  cons  about’  the  use  of  the 
semantic  differential,  and  presents  only  a few  of  the  many  articles  on 
the  technique.  The  first  major  paper  on  the  semantic  differential  was 
by  Osgood  (1952),  in  which  the  development  of  the  technique  as  a general 
method  of  measuring  meaning  was  described.  Ic  involved:  the  use  of  factor 

analysis  to  determine  the  number  and  nature  of  factors  entering  into 
semantic  description  and  judgment;  and  the  selection  of  a set  of  specific 
scales  corresponding  to  these  factors  which  can  be  standardized  as  a measure 
of  meaning.  Using  this  differential,  the  meaning  of  a particular  concept 
to  a particular  individual  can  be  specified  quantitatively.  The  classical 
book  on  the  semantic  differential  was  written  by  Osgood,  Suci,  and  Tannebaum 
(1957). 

Two  studies  that  were  reviewed  investigated  the  reliability  of  the 
semantic  differential,  Jenkins,  Russell,  and  Suci  (1958)  had  360  words 
rated  on  20  scales  by  18  groups  of  30  subjects.  Profiles  of  mean  scale 
values  for  each  concept  were  prepared.  The  reliability  of  these  scale 
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values  was  found  to  be  .97,  and  mean  scale  values  correlated  .97  with 
median  scale  values.  Miron  (1961)  investigated  the  influence  of  instruc- 
tions upon  the  test-retest  reliabilities  of  the  semantic  differential, 
and  found  the  correlations  ranged  from  .996  to  .857.  The  basic  measure 
used  in  the  experiment  was  the  absolute  deviation  between  mean  concept 
scores  on  each  of  five  factor  scores  summarizing  a given  set  of  scales. 

Two  reports  of  the  validity  of  the  semantic  differential  were  reviewed, 
both  in  the  marketing  area.  Agostini  (1962)  reported  evidence  on  the 
validity  of  the  technique  as  an  indicator  of  brand  attitude  as  measured 
by  purchase  behavior.  Significantly  higher  brand  average  attitude  scores 
were  found  among  users  of  two  brands  of  a food  product  than  among  nonusers, 
thus  illustrating  the  validity  of  the  semantic  differential  for  this  use. 
Barclay  (1964)  also  found  that  the  semantic  differential,  in  the  form  used, 
was  a valid  indicator  of  brand  attitudes  as  inferred  from  purchasing 
behavior.  However,  the  differential  as  used  was  found  not  to  be  a very 
sensitive  measure. 

Proximity  error  in  administering  the  semantic  differential  was 
studied  by  Kane  (1968).  Proximity  error  occurs  when,  due  to  the  ordering 
or  polarity  of  the  semantic  differential  scales,  one  answer  results  in 
another  answer  to  a subsequent  question  being  substantially  changed  from 
what  it  would  otherwise  be.  He  investigated  effects  due  to  the  order  of 
concept  presentation,  of  adjective  presentation,  and  of  order  of  adjectives 
within  a particular  scale.  He  found  no  significant  differences  in  response 
traceable  to  questionnaire  format  manipulations,  showing  that  proximity 
error  was  not  a problem  with  semantic  differential  questionnaires. 

Worthy  (1969)  noted  that  semantic  differential  rating  scores  are  often 
reported  as  an  extreme  response  measure  which  ignores  the  middle  or  neutral 
categories  as  a response.  He  reanalyzed  data  and  concluded  that  those  who 
tended  to  make  extreme  responses  also  tended  to  make  midpoint  responses. 

The  implication  for  scoring  was  not  to  make  the  assumption  thit  a midpoint 
response  is  totally  lacking  in  extremeness  since  it  is  a demonstrative 
response.  A related  concern  has  been  whether  or  not  the  semantic  differ- 
ential measures  both  the  intensity  and  direction  of  attitude.  Mehling  (195-9) 
plotted  subjects' ratings  on  an  intensity  scale  against  responses  to  related 
semantic  differential  scales, and  concluded  that  as  used  in  the  study  the 
semantic  differential  did  measure  both  the  direction  and  intensity  of 
attitude.  Rentier  (1969)  also  found  that  semantic  space  is  approximately 
bipolar,  while  Carter,  Ruggels,  and  Chaffee  (1968)  found  that  subjects  can 
more  accurately  denote  their  descriptions  to  objects  when  one  end  of  the 
scale  is  left  for  them  to  describe. 

Semantic  differential  scales  were  compared  with  check  lists  by  Block 
(1958)  and  Hughes  (1967).  Block  found  a correlation  of  .94,  after  correction 
for  attenuation,  between  semantic  differential  descriptions  and  adjective 
check  list  descriptions  of  the  ideal  self  and  the  liked-sex  parent.  He 
concluded  that  the  semantic  differential  may  be  a rather  complicated  way 
of  developing  a measure  that  is  more  readily  and  reliably  secured  by  other 
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n»eans.  Hughes  (1967)  reported  that  287o  of  the  semantic  differential 
scales  he  used  detected  attitude  change,  while  only  117o  of  the  check  list 
scales  did.  Both  showed  the  same  test-retest  reliability,  however  (.58). 
Preference  for  the  semantic  differential  increased  from  117<.  to  347.  as  the 
respondents  became  familiar  with  it,  while  preference  for  the  check  list 
declined  from  577.  to  407„. 

Comparisons  of  the  semantic  differential  with  other  types  of  rating 
scales  appear  in  an  earlier  section  of  this  chapter. 


Other  Types  of  Items 

The  types  of  items  to  be  considered  in  this  section  are  projective 
items,  open-ended  items,  check  lists,  rearrangement  items,  and  matching 
items.  Comparisons  of  check  list  items  and  rating  scale,  forced  choice, 
and  semantic  differential  items  were  discussed  in  previous  sections  of 
this  chapter. 

Projective  items.  The  use  of  projective  items  was  not  a high  priority 
topic  for  this  report,  but  three  reviewed  documents  discussed  them.  In  the 
study  of  attitudes  of  troops  in  the  tropics  authored  by  Hart,  Faust,  Rowland, 
and  Lucier  (1964),  complementary  objective  and  projective  techniques  were 
compared  and  contrasted  for  their  efficacy  in  assessing  attitudes  towards 
items  of  QM  issue  and  situations  relating  to  tropical  military  service. 

They  found  that  projective  and  unstructured  data  collection  techniques 
provided  attitudinal  data  not  captured  by  the  more  structured  techniques. 

They  also  found  that  responses  to  objective  items  correlated  significantly 
with  sentence  completion  items  on  the  same  topic.  Thematic  stimuli  provided 
in  a projective  pictures-wr itten  response  technique  were,  however,  inade- 
quate for  eliciting  the  appropriate  topic  related  attitudinal  responses, 

A color  response  technique  did  not  indicate  any  relationships  with  other 
techniques.  Nevertheless,  the  authors  recommended  that  a combination  of 
highly  structured,  semistructured , and  unstructured  techniques  be  employed 
in  a complex  measurement  setting,  as  is  typical  in  the  case  of  attitude 
measurement . 

In  the  marketing  area,  Halre  (1950)  found  that  projective  methods 
may  aid  in  determining  respondent's  motivations  toward  a stimulus  in  linking 
attitudes  and  behavior.  Steele  (1964)  investigated  the  validity  of  pro- 
jective questions,  and  concluded  that  the  projective  technique  is  a useful 
device  where  inhibitions  may  be  raised  in  an  interview. 

Open-ended  items.  A comparison  of  open-  and  close-ended  questionnaire 
items  was  presented  in  Chapter  II,  while  the  use  of  open-ended  items  to 
determine  questionnaire  content  was  discussed  in  the  first  section  of  this 
chapter , 

On  other  relevant  topics,  Roslow,  Wulfeck,  and  Corby  (1940)  noted 
that  results  from  free  response  questions  may  be  misleading  when  the 
memory  of  the  respondent  and/or  familiarity  with  possible  responses 
operates  to  any  appreciable  extent.  Payne  (1965)  cites  a meaningful  role 
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for  open-ended  questions  in  preliminary  phases  of  research  in  areas  such 
as  the  development  of  categorical,  checkbox  questions,  to  eliminate  the 
need  for  asking  reason-why  questions  of  every  respondent,  or  to  provide 
quotes  which  may  add  interest  to  a report.  Frisbie  and  Sudman  (1968) 
found  a direct  relation  between  the  amount  of  speech  in  open-ended  questions 
and  positive  and  negative  feelings.  People  with  high  positive  or  negative 
feelings  talked  more  than  those  with  low  positive  or  negative  feelings. 

In  both  cases,  those  classified  as  having  high  feelings  had  one  more  sentence 
than  those  classified  as  low. 

Comparison  of  open-ended  items  and  check  lists.  Two  studies  compared 
the  use  of  open-ended  questions  with  check  lists.  Scates  and  Yoemans  (1950b) 
studied  the  effect  of  question  form  on  the  course  requests  that  were  received 
from  adults  employed  in  scientific  and  engineering  fields.  They  found  that 
questionnaires  involving  depth  essay  questions  were  returned  by  a smaller 
proportion  of  persons,  but  the  requests  which  they  contained  were  believed 
to  be  more  firmly  based.  A course  check  list  elicited  a larger  number  of 
course  requests  per  employee  who  returned  it  than  did  questions  which  asked 
the  employee  to  think  of  the  courses  he  may  desire. 

The  check  list  and  open  response  methods  of  survey  research  were 
compared  by  Belson  and  Duncan  ( 1962)  with  respect  to  yesterday's  reading 
of  newspapers  and  magazine^  and  with  respect  to  yesterday's  TV  viewing. 

Results  indicated  that  offering  items  in  the  form  of  a check  list  produced 
an  appreciably  higher  rate  of  claim  that  publications  were  looked  at.  How- 
ever, the  check  list  was  found  to  depress  the  enumeration  of  items  placed 
under  its  "other"  category.  The  open  response  system  produced  only  73%  of 
the  volume  of  endorsements  produced  by  the  check  list,  but  it  gave  1.72 
times  as  many  compared  to  the  "other"  category. 

Other  topics  regarding  the  use  of  check  lists.  Roslow,  Wulfeck,  and 
Corby  (1940)  found  that  the  proportions  obtained  by  alternatives  in  check 
list  questions  tended  to  be  influenced  by  the  number  and  completeness  of 
the  alternatives  presented.  And  Lindzey  and  Guest  (1951)  found  that  omissions 
of  popular  items  from  check  lists  produced  substantial  changes  in  response 
distribution.  They  also  found  that  few  respondents  used  the  "other-write  in" 
category. 

McCormick  (1960)  studied  the  effect  that  the  number  of  questions  asked 
about  each  task  had  on  the  consistency  and  amount  of  information  provided 
by  Air  Force  personnel  when  completing  task  inventories.  No  systematic 
differences  were  found  in  the  number  of  tasks  reported  by  incumbents  who 
were  asked  to  report  one,  two,  three,  or  four  types  of  information  about 
such  tasks.  The  requirement  to  report  more  types  of  information  generally 
provided  more  reliable  information. 

Rearrangement  items.  Sims  (1934)  examined  the  use  of  rearrangement 
tests  of  ability  as  an  alternative  to  objective  tests.  He  found  that  as 
the  length  or  the  rearrangement  set  increased  from  five  to  15  items,  the 
reliability  also  increased.  At  some  point  before  30  items,  however,  the 
length  of  the  test  seemed  merely  to  reflect  the  student's  intelligence. 

He  concluded  that  this  type  of  test  compares  favorably  with  other  types  of 
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objective  tests  In  reliability,  time  for  taking,  and  scoring  time,  when 
the  desire  is  to  measure  the  ability  to  relate  items  to  some  designated 
basis. 

Articles  more  pertinent  to  the  use  of  rearrangement  items  in  question- 
naires were  not  located  during  the  literature  search  which  preceded  this 
review. 

Matching  items.  The  literature  review  uncovered  only  one  article  on 
the  topic  of  item  matching.  Follman,  Urbanke , and  Burley  (1971)  compared 
three  item  matching  objective  test  formats  using  60  undergraduate  students 
who  were  asked  to  match  definitions  to  20  appropriate  verbs  typically  used 
in  essay  type  questions.  The  results  indicated  that  better  scores  were 
obtained  by  ordering  the  items  randomly  but  dividing  them  into  small  groups 
of  three  to  six  items. 

Conclusions  Regarding  the  Pros  and  Cons  of  Various  Types  of  Questionnaire 
Items 

Ranking  items.  Based  upon  five  studies,  it  would  appear  that  ranking 
and  rating  techniques  are  generally  comparable.  There  is  some  evidence, 
however,  that  conclusions  based  upon  a single  judge  differ  from  those  based 
upon  multiple  judges.  More  research  appears  to  be  needed  in  this  area, 
especially  studies  designed  so  that  the  items  to  be  ranked  or  rated  are 
as  comparable  as  possible. 

Contradictory  evidence  was  obtained  regarding  the  comparison  of  ranking 
and  paired  comparisons.  The  bulk  of  the  evidence,  however,  seems  to  support 
the  notion  that  the  two  techniques  produce  comparable  results.  In  two  studies 
a linear  relationship  was  found  between  the  results  obtained  from  the  two. 
Several  investigators  noted  that,  if  the  results  are  comparable,  ranking  is 
to  be  preferred  to  paired  comparisons  since  it  takes  less  time. 

Rating  scale  items.  A majority  of  the  studies  reviewed  found  that 
results  obtained  from  the  use  of  rating  scales  were  comparable  to  those  when 
forced  choice  or  paired  comparison  items  were  employed.  This  is  surprising 
in  terms  of  the  known  properties  and  limitations  of  the  ipsative  scores 
produced  by  forced  choice  devices,  as  discussed  in  Chapter  XI.  However, 

Scott  (1968)  noted  that  generalizations  about  forced  choice  and  single 
stimulus  tests  are  open  to  question  since  mo-st  studies  did  not  use  identical 
items  in  the  two  forms  compared.  If  results  from  the  two  are  comparable, 
then  rating  scales  might  be  preferred  since  they  are  more  efficient  and  take 
less  testing  time. 

Results  comparing  rating  scales  and  card  sorts  are  inconclusive.  Only 
two  studies  were  found  in  this  area.  Results  are  similarly  inconclusive 
in  the  comparison  of  rating  scales  and  semantic  diflerential  items  and 
rating  scales  and  check  lists,  as  few  studies  were  located. 
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Multiple  choice  items.  Comparisons  between  multiple  choice  items  and 
true-false  items  (which  are  a special  type  of  multiple  choice  item)  are 
discussed  in  Chapter  VI.  The  three  studies  that  reported  comparisons  of 
multiple  choice  and  forced  choice  items  did  not  come  to  the  same  conclusion, 
although  there  was  some  tendency  to  feel  that  a choice  of  format  would  depend 
upon  the  number  of  items  available  and  their  statistical  characteristics. 

In  the  one  study  located  that  compared  multiple  choice  items  and  card  sorts 
(Van  Der  Veen,  Howard,  & Austria,  1970),  it  was  concluded  that  the  true-false 
method  was  psychometrically  inferior,  and  the  Q-sort  should  be  used  in  pre- 
ference to  the  multiple  choice  format  if  adequate  testing  time  is  available. 
There  is  also  little  on  which  to  base  a conclusion  regarding  multiple  choice 
and  open-ended  items.  Each  might  have  its  p lace , 'depending  upon  the  purposes 
and  objectives  of  any  given  study. 

Forced  choice  and  paired  comparison  items.  Some  issues  related  to  forced 
choice  and  paired  comparison  items  were  reviewed.  Based  upon  two  studies, 
problems  seem  to  arise  when  the  alternatives  within  a forced  choice  item  are 
not  adequately  matched.  Although  one  investigator  (Zavala,  1965)  pointed 
out  that  the  forced  choice  technique  compared  favorably  with  other  methods 
in  terms  of  reliability  and  validity,  two  others  (Travers,  1951;  Scott,  1968) 
felt  that  some  of  the  claims  made  were  not  supported  by  the  evidence.  A 
more  critical  and  detailed  review  of  the  studies  conducted  in  this  area  is 
probably  in  order. 

Two  investigations  (Berkshire  & Highland,  1953;  Zavala,  1965)  lead 
to  the  conclusion  that  the  best  format  for  forced  choice  items  (at  least 
when  used  for  personnel  ratings)  is  four  statements  per  block,  all  favorable, 
where  the  two  most  descriptive  statements  are  to  be  chosen.  Zuckerman 
( 1952),  however , found  three  statements  preferable  to  two.  In  terms  of 
what  was  noted  regarding  the  need  to  adequately  match  items,  more  research 
on  this  issue  would  probably  be  worthwhile. 

There  still  seems  to  be  some  question  as  to  the  extent  that  forced 
choice  items  can  be  used  to  reduce  undesirable  response  sets,  at  least 
in  terms  of  the  articles  included  in  this  review.  Since  this  was  not  a 
topic  of  great  stress  during  the  literature  review  because  many  of  the 
articles  are  in  the  personality  area,  a more  intensive  literature  review 
on  the  use  of  forced  choice  items  to  control  response  set  might  be  in  order. 

Only  one  study  was  found  comparing  forced  choice  items  and  card  sorts, 
while  only  two  were  located  comparing  forced  choice  items  and  check  lists. 
Conclusions  about  them  would  not  appear  warranted. 

Card  sorts.  A majority  of  the  articles  about  card  sorts  addressed 
the  question  of  whether  or  not  forced  or  unforced  sorts  should  be  used. 

The  conclusion  appears  to  be  that  when  the  same  items  are  used,  it  does 
not  make  much  difference  which  system  is  employed. 


Semantic  differential  items.  There  are  a number  of  investigators 
that  advocate  the  use  of  the  semantic  differential,  and  its  reliability  and 
validity  seem  to  have  been  established  in  the  articles  reviewed.  Block 
(1958),  however,  questioned  whether  the  semantic  differential  may  be  a 
rather  complicated  way  of  developing  a measure  that  is  more  readily  and 
reliably  secured  by  other  means.  Since  the  technique  was  initially 
developed  as  a general  method  of  measuring  meaning,  a more  extensive 
literature  review  regarding  its  use  in  questionnaires  might  be  in  order. 

Other  types  of  Items.  Projective  items,  open-ended  items,  check  lists, 
rearrangement  items,  and  matching  items  were  also  discussed  above.  Since 
there  were  few  studies  about  these  types  of  items ,' cone  fusions  do  not 
appear  warranted. 
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Chapter  IV 


COMPARISON  OF  SCALING  TECHNIQUES 


Once  a selection  has  been  made  of  the  kinds  of  items  that  are  to  be 
used  in  a questionnaire,  it  may  be  necessary  to  determine  scale  values  for 
the  items.  Chapter  IV  is  addressed,  therefore,  to  a comparison  of  scaling 
techniques.  The  literature  review  made  in  conjunction  with  preparing 
this  report,  however,  did  not  stress  articles  on  scaling  techniques,  al- 
though some  were  uncovered.  Since  the  topic  was  not  stressed  and  many 
other  articles  could  be  located,  a discussion  similar  to  those  included 
in  other  chapters  does  not  appear  warranted.  Instead,  the  articles  for 
which  abstracts  were  available  from  the  literature  search  are  listed  be- 
low with  a short  annotation  regarding  their  content.  Comparisons  of 
psychological  scaling  techniques  with  other  types  of  questionnaire  items 
are  discussed  in  Chapter  III  in  those  sections  pertaining  to  rating  scales. 

Ballin  and  Farnsworth  (1941)  developed  a graphic  rating  method  that  had 
scale  values  which  agreed  closely  with  scale  values  obtained  using  the 
Sea shore -Hevner  method  and  the  method  of  equal  appearing  intervals. 

Banta  (1961)  used  the  methods  of  Likert,  Thurstone,  and  a newly  developed 
method  of  Unfold  Partial  Rank  Order  to  measure  attitudes  towards  each  of 
three  referents  differing  in  ambiguity.  The  scores  obtained  from  all  three 
methods  correlated  equally  well  at  each  level  of  referent  ambiguity. 

Barclay  and  Weaver  (1962)  found  that  the  construction  of  a Thurstone  scale 
took  43.2%  more  time  than  the  construction  of  a Likert  scale  with  the 
same  number  of  items,  and  that  the  Likert  scale  was  more  reliable. 

Bartlett.  Heermann  and  Rettlg  (1960)  compared  the  magnetic  board  rating 
technique  to  the  paired  compairson,  ranking,  Likert,  graphic  rating  and 
equal  appearing  intervals  methods.  It  was  concluded  that  all  six  scaling 
techniques  were  equally  accurate  measures  of  scale  value. 

Clark  and  Kriedt  (1948)  applied  Guttman's  scaling  techniques  to  the 
Rundquist-Sletto  attitude  scale.  The  scale  did  not  meet  Guttman's  criter- 
ion for  adequate  scale  undimensionality  despite  the  fact  that  the  internal 
consistency  of  the  scale  was  high.  Thus  the  authors  concluded  that 
Guttman's  method  of  scale  analysis  may  have  serious  limitations  in  the 
area  of  general  attitude  measurement. 

Coombs  (1950)  developed  the  ordered  metric  scale  which  is  based  on  the 
order  of  magnitude  of  the  interval  between  objects. 

Edwards  (1946b) concluded  that  Likert  scales  tended  to  have  higher  relia- 
bility than  Thurstone  scales  and  were  easier  to  construct. 
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Edwards  (1948)  refined  Guttman's  technique  for  determining  cutting  points 
by  assuming  perfect  reproducibility  and  making  predictions  of  item 
responses  on  this  assumption. 

Edwards  (1951)  discussed  the  use  of  the  method  of  successive  intervals, 
which  is  a psychological  scaling  procedure  in  which  stimuli  are  classified 
into  successive  intervals  according  to  the  degree  of  some  defined  attri- 
bute which  they  are  judged  to  possess. 

Edwards  (1956)  concluded  that  using  the  method  of  paired  compar isort-  in 
conjunction  with  a set  of  opinion  statements  with  known  scale  values  had 
promise  for  the  construction  of  attitude  scales  with  a relatively  high  de- 
gree of  reproducibility  and  satisfactory  reliability. 

Edwards  and  Kenny  (1946)  established  the  fact  that  it  is  possible  to  con- 
struct scales  by  the  Likert  and  Thurstone  methods  which  will  yield  compar- 
able scores . 

Edwards  and  Kilpatrick  (1948b) described  the  Scale-Discrimination  method 
which  makes  use  of  Thurstone  scaling  procedures,  retains  Likert's  process 
for  evaluating  the  discriminatory  power  of  the  individual  items,  and  meets 
the  requirements  of  Guttman's  Scale  Analysis. 

Eysenck  and  Crown  fl949)  handled  the  results  of  a study  by:  determining 

reliabilities  under  various  systems  of  scoring  (Thurstone,  Likert  and  Scale 
Product);  factorial  analysis;  Guttman  Scalogram  analysis;  plotting  of 
scale  positions  of  items  against  number  of  endorsements,  percentage  repro- 
ducibility, and  factor  saturations;  and  determining  neutral  point  by 
different  methods. 

Farnsworth  (1945a)  found  that  scale  weights  obtained  for  items  using  a 
technique  modified  from  Allport  approximated  weights  obtained  by  Thurstone 
with  sorting.  The  modified  technique  was  a method  where  extreme  items 
were  put  at  the  opposite  ends  of  a series  of  equilength  lines  representing 
the  individual  items  and  where  the  subjects,  in  a group  situation,  checked 
relative  item  value. 

Farnsworth  (1945b)  as  the  result  of  a study  where  judges  of  statements 
were  asked  their  understanding  of  the  distance  between  degrees  of  the 
scale,  questioned  the  use  of  equal  appearing  interval  scales. 

Federico  ( 1971a) studied  Likert  and  Guttman- type  questionnaire  forms.  He 
found  that  Air  Force  students  demonstrated  significantly  more  favorable 
attitudes  toward  analogous  content  areas  on  the  Guttman-structured  items 
than  on  the  Likert-structured  items.  Evidently,  item  formatting  did  affect 
the  degree  of  the  evaluative  assertions  ascribed  to  the  attitude  universe. 

Ferguson  (1939b) suggested  the  following  requirements  for  attitudinal 
scales:  scale  results  correspond  to  underlying  physical  order;  scale 

values  selected  not  affected  by  other  items  in  scale;  attitudes  of  judges 
of  responses  do  not  affect  scale  values;  specific  in  content;  validity; 
reliability;  and  scale  on  a linear  continuum. 
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Ford  (1950)  illustrated  a rapid  method  for  determining  whether  a set  of 
six  (or  fewer)  attitude  questions  form  a scale. 

Gardner  (1950)  suggested  a technique  for  obtaining  an  interval  scale  not 
dependent  on  an  assumed  normal  distribution  or  the  selection  of  one  given 
population.  The  units  of  this  scale  are  called  K-units. 

Guilford  (1928)  presented  a method  for  getting  scale  values  which  are 
assumed  to  be  on  an  "objective"  continuum.  These  scale  values  are  in 
terms  of  sigma  units  from  an  assumed  mean  of  all  the  stimulus  values. 

Guttman  ( 1947a ) described  the  Cornell  technique , which  is  primarily  the 
combining  of  data  to  produce  cutting  points  that  minimize  error  of  repro- 
ducibility. 

Gulliksen  & Messick  (19b9)  included  in  their  book  discussions  on:  the  method 

of  successive  intervals;  quantitative  judgment  scales;  similarity  of  stimuli; 
metric  properties  of  behavioral  data;  the  method  of  successive  categories; 
ratio  scales;  partition  scales;  confusion  scales;  and  multidimensional  un- 
folding . 

Hughes  (1967)  compared  the  Thurstone  scale,  a modification  of  the  semantic 
differential,  and  the  check  list  scales  for  their  ability  to  detect  changes 
in  attitudes,  their  test-retest  reliability,  and  their  acceptance  among 
respondents . 

Jahn  (1951)  extended  scale  analysis  along  three  lines;  one, to  include 
alternative  methods  for  reduction  of  a set  of  attributes  to  a single  quan- 
titatively defined  variable;  two , to  include  methods  for  the  reduction  of  a 
set  of  attributes  to  a single  qualitatively  defined  variable  or  qualitative 
types;  and  three,  the  development  of  statistical-experimental  tests  to 
decide  whether  the  theorems  of  scale  analysis  are  to  be  accepted  for  appli- 
cation to  a given  empirically  defined  set  of  attributes. 

Kelley,  Hovland,  Schwartz  and  Abelson  (1955)  found  that  data  analysis  using 
the  method  of  equal  appearing  intervals  did  not  discriminate  judges  with 
extreme  views. 

Komorita  (1963)  demonstrated  a neutral  region  could  be  determined  for 
Likert  scores  but  because  of  the  quasi-scale  characteristic  of  the  instru- 
ment no  neutral  point  could  be  clearly  delineated.  Weighting  content 
scores  by  intensity  as  in  the  Likert  method,  instead  of  using  simple 
zero-one  weights,  had  negligible  effects  on  total  score.  However,  if  the 
number  of  items  is  small,  there  seemed  to  be  some  advantage  in  the  Likert 
me  thod . 

Kriedt  and  Clark  (1949)  concluded  that  the  Cornell  Technique  of  Scale 
Analysis  (Guttman)  can  prove  to  be  very  useful  in  problems  of  psychological 
measurement  providing  discretion  is  exercised  in  the  selection  of  suitable 
problems  and  the  handling  methods. 
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Kundu  (1962)  modified  the  scale  points  on  a Likert  scale.  This  modifica- 
tion was  a factor  dividing  rating  method  whereby  the  neutral  point  is 
eliminated  on  theoretical  grounds  and  the  remaining  scale  points  are  not 
fixed  in  advance  by  the  test  author,  but  are  assigned  weights  by  the  re- 
spondent in  accordance  with  his  prevailing  response  bias. 

Likert  (1932)  presented  the  background  and  theory  for  his  measurement 
approaches . 

Prothro  (1955)  found  data  that  supported  Thurs tone's  assumptions  that  the 
sorting  of  items  into  an  attitude  scale  is  independent  of  the  attitude  of 
judges. 

Rozeboom  & Jones  (195b)  stated  that  the  degree  to  which  scale  values  computed 
by  the  method  of  successive  intervals  diverge  from  theoretically  "true"  values 
is  seen  to  be  due  to  three  types  of  error:  error  due  to  inequalities  in 

variances  of  the  distribution  from  which  the  scale  values  are  computed; 
error  due  to  nonnormality  of  the  distribution;  and  sampling  error. 

Saffir  (1937)  made  a comparison  between  scales  constructed  by  the  method 
of  paired  comparison,  rank  order,  and  the  method  of  successive  intervals. 

He  found  mutually  linear  scales,  and  concluded  that  all  the  methods  he 
employed  produced  equally  valid  scales.  Since  the  three  different  methods 
of  gatherin;^  data  (method  of  paired  comparisons,  order  of  merit  method, 
and  method  of  successive  intervals)  and  the  two  different  psychophysical 
techniques  for  scaling  raw  data  (the  law  of  comparative  judgment  and  the 
method  of  successive  intervals)  produced  comparable  scales,  any  one  can  be 
used  with  considerable  confidence, 

Schaie  (1963)  hypothesized  that  the  concurrent  validity  of  questionnaires 
could  be  increased  by  the  use  of  item  weights  obtained  by  expert  scaling, 
instead  of  by  using  conventional  unit  weights.  The  results  showed  only 
low  magnitude  increments  in  validity. 

Seashore  and  Hevner  (1933)  modified  Thurstone's  method  of  equal  appearing 
intervals  by  having  judges  rate  items  on  a nine  point  scale  which  was  print- 
ed on  the  left  hand  margin  of  each  item,  instead  of  sorting  items  printed 
on  separate  slips  into  nine  piles. 

Siegel  and  Schultz  (1962)demonstrated  that  a job  related  technical  skills 
check  list  could  be  scaled  by  both  Thurstone  and  Guttman  techniques. 

Siegel,  Schultz,  and  BenSon *(1960)  hypothesized  that  skills  are  scalable 
in  the  same  manner  (Guttman  and  Thurstone  equal  appearing  interval  scales) 
as  attitudes  and  sensory  phenomena.  Although  their  results  supported  the 
hypothesis,  discrepant  data  raised  some  question  as  to  the  generality  of 
the  hypothesis. 

Siegel  and  Siegel  (1962)  found  that  medians  graphically  derived  and  medians 
from  sorted  judgments  scaled  by  the  method  of  equal  appearing  intervals 
correlated  .97. 
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SjoberR  (1965)  compared  conventional  scaling  techniques  with  his  correla- 
tional scaling  method  using  paired  comparisons  data.  Nonlinear  relations 
between  scales  were  found. 

Stangenberg  (1966)  presented  definitions  of  various  scales  (the  nominal, 
ordinal,  interval,  ratio  and  logarithm)  and  discussed  them  in  terms  of 
measurement  theory. 

Stouffer,  Guttman,  Suchman.  Lazarsfeld.Star , and  Clausen  (1949)  defined  in 
their  book,  the  components  of  a scale  and  discussed  the  limitations  of  the 
use  of  scales.  Their  work  was  a result  of  studies  carried  out  with  military 
subjects  during  World  War  II. 

Taylor  and  Parker  (1964)  found  that  the  graphic  ranting  scales  proved  as 
reliable  as  Guttman  scales, and  an  examination  of  the  interscale  correlations 
showed  that  similar  conclusions  would  be  drawn  from  either  technique. 

Thurstone  (1959)  presented  the  background  and  theory  for  his  measurement 
approaches . 

Torgerson  (1958)  presented  in  his  book  definitions  and  explanations  of 
scaling  methods.  The  book  includes  extensive  numerical  examples. 

Witioy  1 (1954)  experimentally  compared  Thurstone 's  Case  III  and  Case  V 
and  Guildord's  shortcut  approaches  to  scaling  paired  comparison  data. 

The  intercorrelation  between  the  scale  values  obtained  by  the  three  methods 
were  approximately  unity. 

York  (1966)  found  that  Thurstone 's  scale  values  are  stable  over  35  years. 

Zinnes  (1969)  reviewed  the  literature  on  scaling.  The  theme  of 
was  that  scaling  theory  should  be  a theory  of  choice. 


the  review 


Chapter  V 


EFFECTS  OF  VARIATION  IN  PRESENTATION  OF 
QUESTIONNAIRE  ITEMS 


Once  a decision  has  been  made  regarding  the  type  or  types  of  items 
that  are  to  be  used  in  a questionnaire  based  on  the  pros  and  cons  dis- 
cussed in  Chapter  III,  attention  must  be  given  the  actual  development 
of  the  items.  In  this  chapter  consideration  is  given  to  articles  in  the 
literature  that  investigated  the  effects  of  variations  in  the  presenta- 
tion of  questionnaire  items.  Sections  are  included  on  the:  mode  of 

items;  wording  of  items;  clarity  of  items;  difficulty  of  items;  length  of 
question  stem;  order  of  question  stems;  and  order  of  response  alternatives. 


Mode  of  Items 

A series  of  research  studies  were  uncovered  concerning  verbal  versus 
pictorial  presentation  of  items/stimuli  for  subject's  responses.  The 
studies  covered  a variety  of  topical  areas  and  types  of  subjects.  Table 
V-1  summarizes  the  literature  review  conducted  in  this  area. 

Four  studies  found  no  significant  differences  in  subjects'  responses 
to  verbal  and  pictorial  formats  (Blake,  1969;  Greenberg,  1959;  Jensen, 

1930;  and  Rohila,  Shanhdhar  & Sharma,  1966).  Only  one  study,  relating  to 
a consumer  preferences  examination , showed  statistically  significant  differ- 
ences attributed  to  mode  of  item  presentation  (Weitz,  1950).  It  should 
be  noted  that  this  study  did  not  establish  superiority  of  one  format  over 
the  other,  but  merely  noted  differences  on  brand  ratings.  Another  study 
on  the  influence  of  communications  (Luchins  and  Luchins,  1955b)  provided  an 
important  screening  procedure  for  the  use  of  pictures  in  questionnaire 
items.  This  study  suggested  that  conformity  with  false  communications  and 
failure  to  respond  were  higher  for  ambiguous  than  clear-cut  pictures. 
Obviously,  if  pictures  are  to  be  used  they  should  be  pretested  for  clarity 
of  their  presentation  of  the  concept  or  object  to  be  evaluated. 

The  overall  evaluation  of  this  area  of  the  literature  is  that  pictures 
can  be  effectively  eiuployed  in  questionnaires.  This  may  facilitate  obtain- 
ing survey  responses  from  subjects  with  limited  verbal  comprehension  who 
might  have  difficulty  responding  to  questions  employing  lengthy  definitions 
of  concepts  or  objects. 


Wording  of  Items 


The  wording  of  question  stems  and  response  alternatives  is  a critical 
consideration  in  obtaining  valid,  reliable,  and  objective  survey  data. 

For  example,  Payne  (1951)  cited  the  following  illustration.  The  three 
questions  following  were  administered  to  three  separately  matched  samples 
of  respondents  (Payne,  1951). 


Summary  of  Studies  on  Mode  of  Items 
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Do  you  think  anything  should  be  done  to  make  it  easier  for  people 
to  pay  doctor  or  hospital  bills?  (82%  replied  "yes") 
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2.  Do  you  think  anything  could  be  done  to  make  it  easier  for  people 

to  pay  doctor  or  hospital  bills?  (77%  replied  "yes") 

3.  Do  you  think  anything  might  be  done  to  make  it  easier  for  people  to 

pay  doctor  or  hospital  bills?  (63%  replied  "yes") 

These  questions  differed  only  in  the  use  of  the  words  should,  could  and 
might,  terms  that  are  often  used  as  synonyms  even  though  they  have  differ- 
ent connotations.  .The  197o  difference  at  the  extremes  is  enough  to  alter 
almost  any  survey's  conclusions.  This  example  illustrates  a key  feature 
of  much  of  the  evidence  on  question  wording  --  it  is  extensively  topic 
bound.  Most  of  the  studies  dealing  with  framing  questions  were  so  broad 
in  scope  that  no  single  source  of  bias  was  given  concentrated  attention 
(Belkin  and  Lieberman,  1967;  Hubbard,  1950).  It  is  difficult  to  general- 
ize from  the  literature  to  a specific  survey  situation. 

The  literature  review  conducted  for  this  section  uncovered  several 
articles  and  books  purporting  to  offer  "principles  of  question  wording" 
(e.g.,  Payne,  1951;  Roslow  & Blankenship,  1939;  Blankenship,  1942).  Most 
of  the  material  presented,  however,  is  based  on  experience  rather  than 
empirical  research,  and  tends  to  be  more  prescriptive  than  positive,  more 
indicative  than  imperative. 

The  stress  in  the  discussion  of  the  literature  presented  below  is  on 
topics  which  have  been  discussed  in  some  detail:  positive  versus  negative 

wording  of  items;  objective  versus  subjective  wording  of  items;  and 
definite  versus  indefinite  article  ywCr ding.  A section  on  miscellaneous 
studies  on  questionnaire  wording'has  also  been  provided  to  list  areas  which 
have  been  examined  as  "one-shot"  efforts. 

Positive  versus  ne.gative  wording  of  items.  One  topic  in  question 
wording  which  has  received  considerable  attention  is  statement  polarity, 
positively  versus  negatively  phrased  question  stems.  Table  V-2  summarizes 
the  literature  on  this  topic.  It  should  be  noted  that  all  the  studies 
except  Adams  (1956)  concern  question  stem  wording.  Only  three  studies 
(Adams,  1956;  Githens,  undated;  and  Waters,  1966)  were  unable  to  find  an 
effect  on  study  results  produced  by  positive  versus  negative  wording. 

Eleven  studies  reported  significant  effects  on  a variety  of  measures,  such 
as  reliability,  validity,  and  suggestibility  (Blankenship,  1940a;  Burtt  & 
Gaskill,  1932;  Campbell, Siegraan  & Rees,  1967;  Cloud  & Vaughn,  1970;  Edrich, 
1965;  Falthzik  & Jolson,  1974;  Hubbard,  1950;  Muscio,  1916;  Rugg , 1941; 
Rundquist,  1940;  and  Wembrldge  & Means,  1918).  In  general  these  studies 
produced  evidence  that  alternative  positive/negative  (or  neutral  wordings) 
can  produce  demonstrable  effects  on  survey  results  --  a conclusion  not 
arguing  for  either  form  of  phrasing  but  mere  recognition  that  differences 
in  results  existed. 
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Summary  of  Research  on  Positive  Versus  Negative  Wording  of  Items 
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Several  additional  similarities  in  results  among  the  studies  were 
present.  Both  Burtt  and  Gaskill  (1932)  and  Hubbard  (1950)  reported  that 
introducing  a negative  into  the  question  form  increased  respondent  sug- 
gestibility ■ That  is,  there  was  a tendency  for  the  direction  of  the 
question  stem  to  be  chosen  in  the  response  alternatives.  A potentially 
important  interviewing  variable  was  pointed  out  by  the  comparable  find- 
ings of  Falthzik  and  Jolson  (1974)  and  Hubbard  (1950)  . These  two  studies 
illustrated  a tendency  for  statement  polarity  to  be  more  significant 
when  personalized  (what  a person  says  about  himself)  than  nonpersonalized 
(what  he  says  about  others  or  external  events).  Furthermore,  their 
results  indicated  that  when  a personalized  question  was  changed  to  a non- 
personalized version,  suggestibility  was  decreased.  Two  other  studies 
cast  doubt  on  the  use  of  negatives  in  question  stems.  Muscio  (1916),  in 
assessing  the  reliability  of  subjects  reporting  events  they  had  just  ob- 
served, reported  that  the  most  reliable  question  form  was  a subjective 
directed  question  without  negatives.  In  a study  of  appropriate  question 
stems  for  voting  measures,  Wembridge  & Means  (1918)  reported  that  respond- 
ents took  greater  time  and  were  more  confused  with  negative  and  especially 
double  negative  (i.e.  minors  should  not  be  forced  not  to  smoke)  stems 
than  simple  affirmative  versions.  These  findings  are  contradicted  by 
Payne  (1951)  who  reported  from  several  studies  that,  when  people 
have  strong  convictions,  the  wording  of  the  statement  should  not  greatly 
change  the  stand  they  take.  Rundquist  (1940)  also  suggested  that  nega- 
tive items  in  a series  of  personality  measures  tended  to  have  greater 
internal  consistency  than  positively  phrased  items. 

In  conclusion,  loading  by  statement  polarity  choice  may  be  unavoid- 
able but  can  cause  differences  in  research  results.  It  can  even  be 
desirable  when  evaluating  policies  or  objects.  But  when  a particular 
phrasing  is  employed  to  present  a distorted  view  of  opinion  or  the  view 
in  which  the  researcher  thinks  is  "right,"  it  becomes  an  evasion  of  truth, 
or  the  direct  opposite  of  research  (Payne,  1951). 

Objective  versus  subjective  wording  of  items.  Eight  studies  were 
uncovered  relating  to  the  effects  of  stating  question  stems  in  an  objec- 
tive or  subjective  direction.  A study  published  by  Muscio  (1916)  is 
illustrative  of  the  research  in  this  area.  Fifty-six  subjects  were  exposed 
to  a sequence  of  pictures  and  then  asked  if  they  saw  certain  objects  in 
them.  The  study's  dependent  variable  was  suggestibility,  or  the  degree  to 
which  subjects  said  "yes"  to  these  objects  whether  they  were  present  or 
not.  Muscio  concluded  that  changing  from  the  subjective  ("Did  you  see  a 
hat  in  the  picture?")  to  the  objective  ("Was  there  a hat  in  the  picture?") 
reduced  suggestibility.  Table  V-3  summarizes  additional  evidence  in  this 
area . 


Muscio's  evidence  regarding  objective-subjective  direction  and 
suggestibility  has  been  supported  by  empirical  studies  conducted  by 
Blankenship  (1940«),  Dohrenwend  (1965),  and  Hubbard  (1950).  The  only 
conflicting  evidence  uncovered  in  the  area  was  presented  by  Burtt  and 
Gaskill  (1932)  , who  reported  that  the  objective  form  showed  greater  sug- 
gestibility . 
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Fiske  (1969)  Personality  First  person  vs.  300/airmen  Significant  differences  be- 

inventories  third  person  wording  tween  wording  "What  would 

others  say  about  you"  had 
higher  scale  values  than 
se If -description 


TABLE  V-3  (cont 
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i 
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Concerning  response  specificity  or  the  avoidance  of  the  "Don't  know" 
category,  three  studies  with  conflicting ^results  were  found.  Dohrenwend 
(1965)  noted  that  the  objective  version  had  higher  response  specificity, 
while  Burtt  and  Gaskill  (1932)  and  Blankenship  (1940®)  concluded  that 
"don't  knows"  increased  with  the  objective  direction. 

The  reliability  of  objectively  or  subjectively  phrased  questions 
has  also  been  investigated.  The  only  study  presenting  evidetice  of  higher 
reliability  for  objective  question  versions  was  presented  by  Blankenship 
(1940a).  Hubbard  (1950)  and  North  and  Schmid  (1960)  presented  evidence 
that  subjectively  phrased  items  were  more  reliable. 

The  only  research  studies  reporting  validity  evidence  have  been  con- 
ducted by  Blankenship  (1940a,  1940c).  in  both  studies  the  objectively 
stated  version  »f  a question  had  higher  predictive  accuracy. 

It  seems  that  follow-up  research  is  warranted  in  this  area.  The 
limited  research  evidence  points  up  more  contradictions  than  similarities 
in  findings.  The  only  area  where  a tentative  conclusion  favoring  objective 
over  subjective  phrasing  can  be  made  is  in  the  area  of  suggestibility. 

Definite  versus  indefinite  article  wording.  Two  studies  were  found 
which  reached  similar  conclusions  regarding  the  use  of  definite  or  indefi- 
nite articles  in  question  stems.  Indefinite  article  ("a"  or  "an")  items 
are  exemplified  by  the  following  type  of  question  --  "Did  you  see  a demon- 
stration of  the  new  night  vision  device?"  A definite  article  ("the") 
item  would  be  werded  --  "Did  you  see  the  demonstration  of  the  new  night 
vision  device?"  Studies  by  Musclo  (1916)  and  Hubbard  (1950)  both  concluded 
that  changing  from  "a"  to  "the"  wordings  reduced  the  level  of  suggestibility. 
The  use  of  indefinite  article  questions,  however,  led  to  increased  relia- 
bility of  answers  when  factual  or  objective  information  was  sought  (Hubbard, 
1950).  No  conclusions  in  this  area  can  be  drawn  because  of  the  limited 
evidence  available. 

Miscellaneous  studies  on  question  wording.  The  previous  three  areas, 
positive  versus  negative,  objective  versus  subjective,  and  definite  versus 
indefinite  article  wordings,  have  been  researched  in  a somewhat  systematic 
fashion.  This  section,  however,  is  designed  to  present  selected  highly 
relevant  studies  about  question  wording  which  have  not  been  replicated  by 
other  social  scientists. 

Several  isolated  studies  have  dealt  with  the  effect  of  building  into 
the  question  some  reference  to  prominent  people.  For  instance,  Cantril 
( I940a)compared  responses  to  the  following  two  questions:  "Do  you  approve 

of  President  Roosevelt's  sending  Sumner  Welles  to  visit  European  capitals?" 
and  "Do  you  approve  Sumner  Welle.s's  visit  to  European  capitals?"  When 
Roosevelt's  name  was  used,  more  people  had  opinions  and  more  people  dis- 
approved (257o  versus  31%).  In  other  studies  of  the  suspected  "big  name 
effect"  the  results  have  varied,  the  big  name  sometimes  making  a difference 
and  sometimes  not  (Belson,  undated  b). 
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Research  has  also  been  conducted  on  the  consequences  of  employing 
stereotypes,  emotion  charged,  or  culturally  biased  words.  Significant 
changes  in  the  frequencies  of  positive,  negative,  or  don't  know  responses 
may  result  (Roslow,  Wulfeck  & Corby,  1940). 

The  potpourri  of  scattered  studies  on  question  wording  are  illustra- 
ted by  the  following: 

1.  Hubbard  (1950)  reported  that  the  incomplete  disjunction  form,  e.g., 
"Was  the  demonstration  Interesting,  dull  or  just  so-so?"  possessed 
relatively  high  suggestibility  and  relatively  low  reliability. 

2.  North  and  Schmid  (1960)  examined  all  possible  combinations  of 

personal-impersonal  and  qualified-unqualified  forms  of  questions, 
and  concluded  that  personal-qualified  versions  may  be  the  best 
according  to  internal  and  test-retest  reliability  and  independence 
criteria . . 

3.  Steele  (1964)  compared  projective,  direct,  and  indirect  questions. 
Information  from  the  projective  item  (interpretations  of  a picture 
drawing)  explained  more  variance  in  the  dependent  variable  (milk 
consumption)  than  direct  and  indirect  questions. 

4.  Waters  (1966)  presented  evidence  that  a subject's  reaction  to  a 
forced  choice  scale  was  more  favorable  when  some  method  was  in- 
corporated whereby  the  subject  was  given  the  opportunity  to  in- 
dicate the  degree  of  applicability  of  each  item  to  himself. 

"Most  descriptive  of  you"  and  "least  descriptive  of  you"  were  not 
as  effective  as  "the  degree  to  which  this  applies  to  you"  on  a five 
point  scale. 

5.  Thumin  (1962)  experimentally  examined  buffer  items  in  question 
sequencing.  Buffer  items  were  defined  as  neutral  items  intended 
to  establish  rapport  which  were  placed  before  "delicate"  items. 

Study  results  indicated  that  the  buffer  items  increased  respond- 
ent's admissions  of  insomnia. 

6.  In  a study  involving  return  of  a job  satisfaction  questionnaire 
by  477o  of  over  1,000  life  insurance  agents,  the  effectiveness  of 
direct  and  indirect  questioning  techniques  was  examined  (Weitz 

and  Nuckols,  1953).  Results  indicated  that  the  direct  and  indirect 
items  intercorrelated  significantly,  but  that  the  direct  items  in 
general  had  greater  validity  (were  better  predictors  of  job  sur- 
vive 1)  . 

7.  Richardson  (1960)  tested  the  widespread  assumption  that  interview- 
ers should  not  use  leading  questions.  Tape  recorded  interviews  of 
seven  experienced  and  30  untrained  interviewers  were  compared. 
Questions  were  classified  into  leading  and  nonleading  questions. 

A leading  question  was  operationally  defined  as  one  which  includes, 
either  explicitly  or  implicitly,  the  answer  which  the  interviewer 
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expecLed  to  receive.  Study  results  showed  that;  contrary  to 
expectation,  experienced  interviewers  used  lending  questions  in 
33%  of  all  their  questions;  and  leading  question’?  t/llcited  no 
more  responses  containing  distorted  InforiimLion  thiin  did  non- 
leading  questions. 

Conclusions . The  literature  discussed  in  the  area  of  question  word- 
ing illustrates  certain  gaps  and  thlnnesscc: , Many  triportant  isfn;’:';  have 
been  raised  by  these  studies,  but  tSere  tias  been  little  systematic  p'rsult 
of  the  issues  to  a conclusion.  Seldom  have  repl 'cation  situdlcs  been  un- 
covered or  attempts  made  to  carry  'hem  out  in  dili'-ront  settings.  Mauy 
of  the  studies  have  been  carried  out  vritli  snull  samples  or  only  with 
college  student  subjects.  Finally,  ?io  research  studio.?  were  uncovered 
which  examined  the  wording  of  response  altcrnativec . 


Clarity  of  Items 

Almost  every  check  Mfit  of  prescriptlves  for  writing  quo3t/on  Iteir-: 
includes  instructfons  such  as;  "make  sure  your  quesf'ioiiq  are  clear  to 
the  respondent;"  ".^void  ambiguous,  vague,  and  impreciso  quc.v tionr. a?id 
"questions  must  be  concrete  and  specific."  Unfortunately,  there  has  been 
?.ittlc  systems ! tr.  research  on  measures  of  clarity  or  fr.^  effects  of  un- 
clear questions  on  subjects'  responses.  In  fact,  most  of  the  available 
literature  '3  based  upon  authors'  experiences,  intuition,  and  "common" 
sense  (t-.g.,  Jenkins,  1941;  Roslow  & Blankenship,  1939).  This  issue  has 
rece  ived  some  investigation  in  the  area  of  interviewer  ambig  lity  in  phras- 
ing questions  (Hanson  & Marks,  1958),  a topic  not  covered  hire. 

This  section  will  present  the  limited  evidence  available  on  how  to 
improve  item  clarity  and  the  effects  of  item  clarity  on  subjects'  re- 
sponses. 

Studies  on  improving  item  clarity.  Several  interesting  studies  have 
been  reported  which  suggested  diverse  tactics  of  improving  item  clarity. 
Gray  (1955)  suggested  that  in  framing  questions  which  depend  on  res*-'ond- 
ents'  memory  or  recall  capabilities , the  time  period  a question  covers 
must  be  carefully  defined  and  redefined.  The  when  should  be  specifically 
provided.  Lltwak  (1956)  suggested  that  ad  hoc  rules  on  question  wording, 
such  as  cautions  against  loaded,  vague,  double-barreled  questions,  can  be 
investigated  by  latent  structure  analysis.  Evidence  was  presented  that 
bias  in  questions  may  lie  in  too  many  (ambiguity),  too  few  (clarity),  or 
inappropriate  (clarity)  dimensions.  Thus,  alternative  question  wording 
and  additional  descriptions  may  aid  a subject's  interpretation.  Toops 
(1937)  reported  that  subjects  preferred  a format  where  key  v:ords  or 
portions  of  the  question  stem  were  capltall?;cd.  The  inference  was  that 
an  idea  of  what  is  required  to  respond  to  the  question  and  overall  clarity 
can  be  obtained  by  a glance  at  the  capitalized  material.  No  results  were 
reported  concerning  whether  underlining  key  words  miaht  accomplish  the 
same  goal. 


V-13 


BEST 

AVAILABLE  COPY 


Several  studies  were  found  which  are  attempts  to  isolate  the  amount 
of  clarity  in  questions.  Speak  (1967)  conducted  a study  whereby  subjects 
who  had  responded  to  questions  in  a personal  interview  were  reinterviewed 
the  following  day  by  another  "in-depth"  session  to  ascertain  what  the 
respondent  had  "really"  meant  and  how  he  interpreted  the  questions.  It 
was  found  that  not  one  question  was  perceived  by  every  subject  as  intended, 
nor  did  one  subject  perceive  all  the  questions  as  intended.  It  appears 
then  that  follow-up  interviews  might  be  purposeful  in  screening  paper  and 
pencil  question  items.  For  example,  Nuckols  (1953)  submitted  poll  questions 
to  respondents  and  then,  after  completion,  asked  them  to  interpret  the 
meaning  of  the  questions.  At  least  177o  of  these  interpretations  were  judged 
to  be  wholly  or  partially  wrong. 

Another  clarity  screening  method  has  been  oftered  by  Norman  (1963b). 

He  conducted  a study  of  test  item  content  in  personality  measurement.  The 
results  indicated  that  there  existed  marked  differences  in  the  validities 
obtainable  from  different  classes  of  test  stimuli,  those  with  the  highest 
degree  of  judged  content  relevance  producing  the  most  satisfactory  results. 
To  the  degree  that  relevance  enhances  the  clarity  of  questions,  this  would 
also  seem  to  be  an  appropriate  pretesting  procedure. 

A technique  called  the  "random  probe"  was  used  to  check  what  closed 
questions  actually  meant  to  respondents  in  a survey  in  Pakistan  (Schuman, 
1966)  . Interviewers  were  instructed  to  select  randomly  10  items  for 
further  probing.  Respondents'  understanding  was  then  ranked  on  a five 
point  scale.  Results  indicated  that  with  this  particular  instrument  a 
significant  minority  of  the  respondents  had  real  difficulty  with  the  ques- 
tions . 

Miklich  (1966)  studied  response  sets  in  relation  to  ambiguously 
worded  statements.  Forty-two  subjects  were  given  statements  with  four 
types  of  treatment;  ambiguous,  unambiguous,  important,  and  unimportant. 

They  were  asked  whether  they  agreed  or  disagreed.  The  analysis  indicated 
that  ambiguous  items  did  result  in  more  agreement-disagreement  response 
set.  That  is,  if  the  ambiguous  item  was  important  (not  defined  in  the 
study  writeup)  the  tendency  was  to  agree  with  it,  while  if  unimportant, 
the  tendency  was  to  disagree. 

A large  scale  study  (Bclson,  undated  c)  of  respondent  understanding 
of  over  2,000  items  used  in  market  and  social  survey  questions  included 
a content  analysis  of  reinterview  data  with  265  subjects  regarding  their 
understanding  of  the  original  questions.  Findings  related  to  item  clarity 
were;  if  a broad  term  or  concept  was  used  in  a question,  there  was  a 
strong  tendency  for  respondents  to  interpret  it  less  broadly;  and  respond- 
ents who  failed  to  hear  some  part  of  a question  tended  to  reconstruct  the 
question  from  what  they  had  heard. 

Effect  of  item  clarity  on  subjects'  responses.  Few  studies  have  been 
conducted  in  this  area,  perhaps  because  question  clarity  is  itself  such  a 
vague,  general  concept.  One  important  study  was  offered  by  Armstrong  and 


Overton  (1971) . Two  versions  of  a questionnaire  about  intentions  to  use 
a new  transportation  service  were  tested.  One  version  using  a brief 
description  and  one  using  a comprehensive  description  were  successively 
administered.  No  significant  differences  were  found  on  estimates  of 
level  of  demand  at  various  prices,  or  on  the  identity  of  likely  user 
groups.  Thus,  in  some  cases,  additional  verbal  material  in  questions  or 
topic  descriptions  may  not  alter  subjects'  responses. 


Difficulty  of  Items 

One  of  the  first  "laws"  of  questionnaire  development  advanced  by 
almost  every  general  source  on  how  to  write  sound  questionnaires  is  the 
statement  "keep  it  simple."  Logic  dictates  that  words  used  in  surveys 
should  not  have  multiple  meanings,  nor  should  they  be  beyond  the  level  of 
vocabulary  of  the  typical  respondent.  Unfortunately,  this  advice  is  often 
poorly  operationalized. 

This  section  discusses  measures  of  item  difficulty,  and  miscellaneous 
studies  of  survey  instruments.  The  abstracted  literature  on  item  diffi- 
culty is  summarized  in  Table  V-4. 

Measures  of  item  difficulty.  A series  of  studies  have  taken  standard- 
ized tests  or  published  public  opinion  poll  questions  and  subjected  them 
to  a form  of  content  analysis  against  reading  or  vocabulary  difficulty  in- 
dices. Payne (1950a)  found  that  "tightly  worded"  questions  on  an  opinion 
poll  had  Flesch  scores  at  7th  or  8th  grade  level  whereas  "loose"  questions 
with  large  variance  in  reverse  worded  items  scored  at  the  high  school 
level  or  above. 

Similarly,  Nuckols  (1953)  reported  that  nine  published  poll  questions 
had  remarkable  problems  in  wording  difficulty.  In  an  independent  retest, 
17%  of  his  subjects  had  interpretations  of  individual  questions  which  were 
judged  partially  or  wholly  wrong.  Flesch  scores  ranged  from  5.8  to  17.2 
in  reading  grade.  Another  study  (Terris,  1949),  again  a reexamination  of 
poll  questions,  compared  Flesch  and  Dale-Chall  readability  scores  to  Census 
Bureau  Reports  on  formal  school  levels  of  the  U.S.  population.  Study 
results  indicated  that; 

1.  91.6%  of  all  the  questions  were  above  the  comprehension  level 

of  12.4%,  of  the  population. 

2.  73.4%  of  all  the  questions  were  above  the  comprehension  level  of 

23.2%  of  the  population. 

3.  9.8%,  of  all  the  questions  were  above  the  comprehension  level  of 

72.6%,  of  the  population. 

Difficulty  of  items  has  also  been  assessed  witii  The  Teacher ' s Word 
Book  of  30 ,000  Words  (Thorndike  and  Lorge , 1944).  Users  of  this  source 
state  that  it  is  best  to  err  on  the  side  of  simplicity  if  doubt  exists. 
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There  are  many  examples  of  misunderstandings  of  what  seem  to  be  everyday 
words.  One  study  (Roeber , 1948)  found  that  10%  to  20%  of  the  vocabulary 
used  in  seven  of  the  most  popular  interest  inventories  was  above  the  9th 
grade  reading  level  as  measured  by  the  Thorndike  and  Lorge  word  list. 

Unfortunately,  no  studies  were  found  concerning  the  comparative  re- 
sults of  reading  level  measures  using  scores  on  the  Flesch,  Dale-Chall, 
Thorndike  and  Lorge,  or  other  readability  scales.  Also,  no  information 
was  uncovered  regarding  the  "fog  level"  reading  difficulty  scoring  system 
used  by  the  U.  S.  Air  Force,  or  Fry's  Readability  Graph  (Fry,  1968). 
However,  a detailed  literature  search  in  these  areas  was  outside  the 
scope  of  this  review. 

Hanley  (1965)  suggested  two  other  measures  of  item  difficulty  with 
reference  to  personality  testing;  response  latency  and  subjective  confi- 
dence in  accuracy  of  answer.  Either  of  these  measures  might  also  be  em- 
ployed in  pilot  or  pretest  studies  of  survey  instruments. 

Miscellaneous  studies  of  survey  instruments.  More  attention  has 
evidently  been  devoted  to  measures  of  item  difficulty  than  to  the  effects 
of  item  difficulty  on  questionnaire  responses.  Hanley  (1965)  and 
Strieker  (1963)  offer  the  exceptions.  Both  have  examined  the  impact  of 
item  difficulty  on  acquiescence  response  bias,  but  with  conflicting  conclu- 
sions. Strieker  determined  that  acquiescence  was  more  prevalent  with 
moderate  or  hard-to-read  attitude  items,  but  found  the  opposite  relation- 
ship for  personality  items.  Using  the  response  latency  and  subjective 
confidence  difficulty  measures,  Hanley  (1965)  concluded  that  acquiescence 
occurred  with  difficult,  rather  than  easy,  personality  inventory  material. 
Additional  attention  obviously  needs  to  be  focused  on  item  difficulty 
in  terms  of  response  tendencies  such  as  acquiescence,  "don't  know"  and 
"no"  responses. 

One  study  (Faerber , 1951)  has  addressed  the  important  matter  of  com- 
parative difficulty  of  different  response  alternative  formats.  In  a timed 
arithmetic  test,  open  answer,  right-wrong,  multiple  choice,  and  multiple 
choice  with  separate  answer  sheet  formats  were  experimentally  manipulated. 
Results  showed  increasing  difficulty  in  the  order  listed  here. 

Finally,  Myers  (1962)  compared  homogeneous  to  heterogeneous  item 
difficulty  educational  tests.  No  difference  in  validity  coefficients 
were  found,  but  tests  homogeneous  in  difficulty  were  shown  to  be  more  re- 
liable . 


Length  of  Question  Stem 


Only  three  studies  were  found  in  the  literature  search  effort  which 
dealt  with  length  of  question  stems.  It  should  be  noted,  however,  that 
the  topic  of  instrument  length  is  discussed  in  Chapter  VIII. 
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The  first  study  located  (Brinkme  ier , 1930)  was  only  marginally 
relevant  to  questionnaire  construction  because  it  concerned  true-false 
tests  for  high  school  student  examinations.  A statistical  analysis  of 
6,671  question  stems  submitted  by  high  school  teachets  in  a national 
contest  revealed  that  stems  under  twenty  words  in  length  were  as  often 
true  as  false.  As  length  increased  beyond  twenty  words,  however,  the 
probability  that  the  answer  was  true  increased. 

Marquis,  Cannell,  & Laurent,  (1972)  examined  the  impact  of  question 
length  and  respondent  education  on  self-reports  of  health  information. 

The  data  were  later  compared  to  physicians'  reports  of  the  same  informa- 
tion for  each  respondent.  Results  indicated  that  longer  (interview) 
questions  increased  the  accuracy  of  reports  from  those  who  had  finished 
high  school,  and  had  the  opposite  effect  on  those  who  had  not.  Another 
study  (Laurent,  1972),  perhaps  reporting  on  the  same  data  base  as  the  pre- 
vious citation,  offered  evidence  drawn  from  four  experiments  conducted 
with  samples  ranging  from  24  to  200  interviews  for  the  U.  S.  Public  Health 
Service.  Questions  were  altered  in  length  by  adding  redundant,  inconse- 
quential information  in  various  treatments.  It  was  found  that  the  longer 
questions  elicited  more  information  than  the  shorter  questions.  Also, 
after  checking  with  physicians'  reports,  it  was  found  the  longer  questions 
received  more  accurate  answers. 

This  is  apparently  an  underresearched  area.  The  few  isolated  studies 
just  reviewed  concern  only  objective  tests  and  interview  schedule  develop- 
ment. The  conclusion  that  longer  question  stems  (controlling  for  age) 
produce  a greater  amount  and  more  accurate  information  cannot  be  general- 
ized based  upon  the  limited  evidence.  More  research  is  warranted  in  this 
area . 


Order  of  Question  Stems 

Several  different  sources  of  error  must  be  considered  regarding  the 
general  issue  of  question  stem  effects  in  questionnaire  methodology. 

Order  bias  has  several  meanings  in  survey  research  concerning  question 
stems.  For  example,  if  the  question  were  asked,  "Which  kind  of  weapon  do 
you  prefer,  the  M14  or  the  M16,"  one  might  conjecture  that  a reversing  of 
the  order  of  alternatives  within  the  question  stem  might  be  a source  of 
respon.se  error.  Literature  in  this  area  is  discussed  in  the  section  of 
this  chapter  on  the  Wording  of  items.  Order  bias  in  this  section  refers 
to  the  order  of  questions  within  a series  of  items  designed  to  explore 
the  same  subject  matter,  or  related  subject  matter  areas.  A related  issue 
concerns  the  position  effect  problem  --  the  order  of  different  groups  of 
questions,  when  the  groups  deal  with  essentially  unrelated  subject  matter 
areas . 

Table  V-5  presents  a summarization  of  literature  dealing  with  the 
order  bias  and  position  effect  problems. 
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Summary  of  Studies  Relating  to  the  Order  of  Question  Stems 
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Order  bias.  One  of  the  most  typical  caveats  discussed  in  the  general 
literature  of  how  to  construct  questionnaires  is  the  statement,  "vary 
(randomly  assign)  the  order  of  questions  on  an  instrument  to  avoid  one 
question  contaminating  another."  Especially  prominent  are  the  discussions 
v..f  cases  where  the  immediately  preceding  question  or  group  of  questions 
places  the  respondent  in  a different  "mental  set"  or  fram.e  of  reference. 
For  example,  asking  respondents  a general  question  about  their  feelings 
about  automobile  exhaust  pollution  might  influence  responses  to  a question 
like:  "Do  you  prefer  leaded  or  nonleaded  gasoline?"  Although  this  effect 

may  be  prominent  in  specific,  applied  settings,  little  evidence  was  found 
in  the  literature  supporting  a general,  order  bias  phenomenon  in  survey 
research.  Five  studies  (Baehr,  1953;  Blumberg,  DeSoto  & Kuethe,  1966; 
Brenner,  1964;  Ferber,  1966;  and  Lyman,  1949)  were  unable  to  document  orde 
biases  in  investigations  in  divergent  topical  areas.  Five  studies  (Cohen, 
1965;  Gross,  1964;  Hofstee,  1966;  O'Dell,  1962;  and  Survey  Research  Centre 
1972)  found  that  the  presentation  order  of  question  stems  significantly 
affected  response  distributions  to  given  items,  nonresponse  to  items,  and 
preferences  for  specific  stimuli.  Thus,  the  findings  in  this  area  are 
inconclusive  --  no  support  exists  for  the  presence  of  a general  order 
effect  in  questionnaire  responses.  Although  the  literature  that  was 
reviewed  on  this  topic  was  sparse,  it  appears  that  order  effects  are  a 
function  of  the  specific  instrument  and  subjects  employed  in  the  investiga 
tion.  It  is  interesting  to  note  that  order  bias  or  question  sequence  may 
be  a subtle  issue  in  specific  cases.  More  experiments  testing  the  effect 
of  changing  the  sequence  of  questions  have  been  uncovered  that  show  no 
effect  than  show  significant  differences. 

Position  effect.  Practical  advice  on  how  to  avoid  position  bias 
problems  abounds  in  the  questionnaire  development  literature.  Suggestions 
to  phrase  questions  in  a logical  sequence,  build  rapport  first,  ask  for 
the  basic  information  sought  next,  and  personal  questions  last,  are  illus- 
trative of  the  guidelines  offered  the  questionnaire  designer.  From  the 
literature  review,  it  appears  that  the  extent  of  a general  position  bias 
is  unknown.  This  is  an  area  that  is  poorly  documented.  Four  studies 
(Bradburn  & Mason,  1964;  Cohen,  1965;  Lyman,  1949;  and  Metzner  & Mann, 
1953)  were  unable  to  find  any  effect  of  changing  the  sequence  in  which 
major  sections  of  questionnaires  were  presented.  Conflicting  evidence 
was  offered  by  experimental  results  presented  by  Landon  (1971)  and  the 
Survey  Research  Centre  (1970).  Again,  it  must  be  concluded  that  systema- 
tic research  in  this  area  is  lacking.  As  in  the  previous  case,  however, 
position  bias  may  be  operative  in  specific  research  situations,  but  the 
weight  of  the  evidence  supports  a negligible  influence  of  position  bias 
on  survey  findings. 

Cone  fusions . The  results  in  the  areas  of  order  bias  and  position 
effect  cannot  be  regarded  as  definitive.  In  light  of  the  unknown  in  these 
areas,  individual  questions  and  question  sections  can  probably  bo  placed 
into  whatever  appears  to  be  the  best  psychological  or  most  logical  order. 


Order  o£  Response  Alternatives 

One  of  the  principles  of  questionnaire  development  advanced  by  psy- 
chologists is  that  the  responses  to  a particular  proposition  will  be 
influenced  by  the  position  of  the  alternative  in  the  question.  In  the 
literature  of  questionnaire  methodology,  it  is  also  known  as  the  "time 
error"  and  can  occur  in  questionnaire  applications  as  well  as  with  labora- 
tory methods.  Mathews  (1929),  in  one  of  the  earlier  works  to  recognize 
this  response  pattern,  noted  in  reviewing  the  results  of  an  experimantal 
study  that,  although  overall  differences  in  sequencing  existed,  the  first 
of  two  alternatives  in  a question  where  the  order  was  varied  received  more 
endorsements  than  the  second  position.  This  study  also  suggested  that  the 
fourth  (of  five)  response  alternatives  was  chosen  somewhat  more  frequently. 
Mathews'  work  has  received  only  token  empirical  support  with  respect  to 
other  reviewed  literature.  Belson  (1965)  and  Winthrop  (1958)  offered 
evidence  that  reversal  of  verbal  or  numeric  rating  scale  response  alterna- 
tives are  coupled  with  a significant  shift  in  endorsements  toward  the  first 
presented  end  items  or  anchors.  Belson  (1965)  reported  that  a reversal 
from  positive  to  negative  scale  orders  resulted  in  a greater  proportion  of 
choices  of  negative  (or  unfavorable)  end  categories.  Winthrop 's  evidence 
suggests,  similarly,  that  reversal  of  numerical  preference  alternatives  in 
natural  numbers  order  (e.g.,  1,  2,  ...,  5)  results  in  lower  scale 
reliability . 

Two  additional  studies  documenting  an  order  effect  in  response  alter- 
natives were  found.  Becker  (1964)  reported  that  subjects'  choices  of 
their  five  favorite  types  of  radio  and  T.V.  programming  were  influenced 
by  the  ordinal  position  of  the  choices  in  a checklist.  This  study  suggests 
that,  as  an  item  is  listed  close  to  the  end  of  a checklist,  the  probability 
of  its  selection  is  reduced.  Madden  and  Bourdon  (1963)  found  that  revers- 
ing the  order  of  levels  of  job  factors  that  were  presented  to  airmen  for 
evaluation  of  various  jobs  resulted  in  significant  differences  in  job 
ratings . 

The  studies  discussed  above  must  be  regarded  as  the  exceptions  rather 
than  the  rule  in  this  research  area.  Seven  experimental  studies  (Blumberg, 
DeSoto  & Kuethe,  1966;  Campbell  & Mohr,  1950;  Clark,  1956;  Dyer,  Klein,  & 
Yudowitch,  1975;  Feldman,  1969;  Kane,  1971;  and  Symonds , 1936)  reported 
little  or  no  order  effects  with  response  alternatives.  The  first  study, 
for  example,  experimentally  manipulated  the  "good"  end  of  a graphic  rating 
scale  in  left,  right,  top,  or  bottom  positions  with  minimal  resultant 
effect  on  ratings.  The  analyses  conducted  by  Dyer,  Klein,  and  Yudowitch 
(1975)  concerned  a VOLAR  study  administered  at  Fort  Hood,  Texas,  with 
over  500  military  subjects.  Reversal  of  response  alternatives  was  accom- 
plished by  presenting  one-half  of  the  subjects  with  alternatives  listed 
from  most  positive  to  most  negative;  e.g.,  "The  training  I have  received 
at  Fort  Hood  has  been:  very  challenging,  challenging,  borderline, 

unchallenging,  very  unchallinging."  The  remaining  subjects  received 
response  alternatives  listed  from  most  negative  to  most  positive.  This 
treatment,  used  on  both  attitude  and  satisfaction  scales  in  the  VOLAR 
questionnaire,  did  not  produce  significant  differences  on  either  individual 
items  or  categories  of  items. 
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Several  problems  uncovered  by  the  literature  reviewed  tor  this  section 
preclude  arriving  at  any  valid  generalization  concerning  order  effects  in 
response  alternatives.  First,  most  of  the  available  published  studies 
have  been  conducted  with  relatively  small  samples  of  college  student  sub- 
jects. Second,  the  number  of  studies  conducted  in  this  area  is  limited. 
Third,  no  systematic  research  has  been  published  with  respect  to  the  order 
of  response  alternatives  in  specific  types  of  rating  or  scaling  devices, 
such  as  graphic  or  verbal  ratings,  semantic  differential,  or  Likert  scales. 
Fourth,  important  moderating  variables  such  as  subjects'  characteristics, 
topical  area,  scale  length  (number  of  response  alternatives),  and  instru- 
ment length  have  neither  been  controlled  nor  built  in  as  experimental 
trea  tments . 

The  reviewed  studies  on  the  order  of  response  laternatives  are  sum- 
marized in  Table  V-6  . Because  of  the  inconclusive  nature  of  the  findings 
and  their  contradictions,  care  probably  should  be  taken  to  alternate  the 
order  of  response  alternatives  when  it  appears  appropriate  to  do  so.  In 
this  vein,  it  should  be  noted  that  several  authors  (Ross,  1934;  Hosier  & 
Price,  1945)  have  developed  tables  to  standardize  the  order  of  presenta- 
tion of  words.  Although  these  tables  were  designed  to  provide  systematic 
variation  of  paired  comparison  and  multiple  choice  items,  they  may  be 
applicable  to  verbal  rating  scales  as  well.  Ross  (1934)  states  that  his 
method  will  aid  in  wording  "regular"  repetition  patterns,  providing 
optimum  spacing  between  identical  words,  and  balancing  out  fatigue  effects. 
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Chapter  VI 


A 


NUMBER  OF  RESPONSE  ALTERNATIVES  AND 
RESPONSE  ANCHORING 


The  effects  of  variation  in  the  presentation  of  questionnaire  items, 
including  the  order  of  response  alternatives,  was  discussed  in  Chapter  III. 

This  chapter  considers  two  related  topics:  the  number  of  response  alterna- 

tives to  employ;  and  response  anchoring. 

Issues  Regarding  Number  of  Response  Alternatives  to  Employ 

One  of  the  basic  issues  in  the  use  of  any  given  rating  instrument  or 
attitude  scaling  device  is  the  determination  of  the  optimum  number  of 
response  alternatives  or  categories.  Researcher's  habit,  or  tradition 
rather  than  solid  empirical  support,  often  has  led  to  the  recurrent  use 
of  five  point  Likert  scales,  seven  point  semantic  differential  scales,  and 
so  on.  The  reason  for  concern  with  the  number  of  response  alternatives 
stems  from  the  belief  that  a "coarse"  scale  with  too  few  response  alterna- 
tives may  result  in  a loss  of  information  concerning  subjects'  discrimina- 
tion powers,  or  reduced  cooperation  in  rating  reflecting  a dislike  for 
"forcing"  a judgment.  An  extremely  "fine"  scale,  with  too  many  response 
alternatives,  may  go  beyond  the  subjects'  powers  of  discrimination,  be 
excessively  time  consuming,  or  difficult  to  score. 

The  literature  search  in  the  area  of  number  of  response  alternatives 
was  very  productive.  Over  30  studies  were  found  which  were  directly 
related  to  this  issue.  Table  VI-1  summarizes  the  literature.  The  final 
three  columns,  headed  Reliability,  Validity,  and  Other  Findings,  illustra- 
te that  multiple  criteria  have  been  used  in  investigating  the  issue  of 
the  optimum  number  of  response  alternatives  to  employ.  The  major  criteria 
used  have  been  reliability,  validity,  factors  influencing  subjects'  moti- 
vation and  ability  to  respond,  and  scoring  and  data  analysis  considerations. 

Each  of  these  criteria  will  be  discussed  below. 

Reliability.  Numerous  studies  (Bendig , 1953 ; Bendig , 1954a ; Komarita  & 
Graham,  1965;  Jacoby  & Matell,  1971;  Masters,  1973;  Matell  & Jacoby,  1971;  Saun- 
ders & Ward,  1964)  in  the  psychometric  literature  have  shown  that  increasing  the 
number  of  response  alternatives  does  not  necessarily  increase  reliability. 

These  empirical  efforts  have  employed  a wide  variety  of  response  alterna- 
tive combinations  as  experimental  treatments,  for  example,  two  through  19 
alternatives  inclusively,  and  were  conducted  using  several  types  of  rating 
scales  in  the  context  of  numerous  topical  areas.  It  should  be  noted  that 
all  the  studies  above  except  the  two  Jacoby  and  Matell  efforts  were  con- 
cerned with  internal  consistency  measures,  that  is,  equivalent  forms  or 
split-half  reliability.  Jacoby  and  Matell  (1971)  examined  both  internal 
and  temporal  (test-retest)  consistency  and  concluded  that  both  measures 
were  independent  of  the  number  of  response  alternatives. 


(Table  continued  on  next  page) 
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Saunders  & 2 vs.  multiple  Bipolar  scales/  282/collcge  No  difference  in  reliability  Efficiency  - no  difference 

Vard  (1964)  choice  personality  In  the  proportion  of  positive 

responses 
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Several  examples  of  studies  exist  in  the  literature  which  indicate 
that  a nonlinear  relationship  exists  between  the  range  of  response  alter- 
natives and  the  magnitude  of  the  coefficient  of  reliability.  For  example, 
in  separate  investigations  Sendig,  (1953  , 1954a,  1954c)  found:  reliabil- 

ity was  equal  for  three,  five,  seven  and  nine  point  scales,  but  lower  for 
11  alternatives;  rater  reliability  (suimned  ratings  for  each  object  rated) 
was  constant  from  five  to  nine  alternatives  but  slightly  higher  at  three 
and  slightly  lower  at  two  categories;  and  rater  reliability  was  highest 
with  four  alternatives  and  lowest  with  two  alternatives.  Neidt  and  Merrill 
(1951)  reported  that  the  reliability  of  a five  point  rating  scale  was  slightly 
higher  than  a two  alternative,  positive-negative,  format.  In  a person- 
ality assessment  study,  Symonds  (1924)  contended  that  fewer  than  seven 
scale  alternatives  resulted  in  a loss  of  reliability,  but  employing  great- 
er than  seven  did  not  improve  reliability. 

The  above  studies  seem  to  suggest  that  there  is  an  optimal  number  of 
response  alternatives  to  employ  for  any  given  investigation  situation, 
including  the  topic  area,  characteristics  of  the  subjects,  etc.  For  exam- 
ple, in  other  studies  results  were  dependent  on  the  type  of  rating  instru- 
ment used.  Komorita  and  Graham  (1945)  found  that  increasing  the  number  of 
response  alternatives  improved  reliability  for  heterogeneous  scales  with  dis- 
similar item  content,  but  had  no  effect  on  homogeneous  scales.  Masters  (1973) 
demonstrated  that  the  reliability  of  a traditionalism  of  education  scale 
was  independent  of  the  number  of  response  alternatives,  but  in  a progress- 
ivism  scale  reliability  increased  from  two  to  three  alternatives. 

Validity . Jacoby  and  Matell  (1971)  pointed  out  that  most  of  the 
psychometric  literature  dealing  with  the  number  of  alternative  issues 
emphasized  reliability  as  the  major,  and  in  some  cases,  only  criterion 
in  the  choice  of  the  number  of  seal*  points.  They  felt,  however,  that 
the  ultimate  criterion  is  the  effect  a change  in  the  number  of  scale  points 
has  on  the  validity  of  the  scale.  Table  VI-1  illustrates  that  only  two 
original  studies  addressed  the  validity  question. Neidt  b Merrill  (1951)  in  an 
attitude  toward  education  investigation  reported  no  difference  in  con- 
current validity  between  two  and  five  alternative  rating  scales.  This 
study  reported  mean  course  marks  and  scale  scores,  holding  constant  ACE 
scores  and  hours  studied  per  week. 

The  only  study  examining  both  concurrent  validity,  with  attitudes 
and  behavior  measured  at  one  point  in  time,  and  predictive  validity  (cor- 
relation of  observed  behavior  with  that  which  was  predicted  from  attitude 
measures)  was  Jacoby  and  Matell  (1971).  In  relation  to  both  measures, 
the  authors  concluded  that  no  consistent  relationship  existed  between 
either  measure  and  number  of  response  alternatives  employed. 

Although  the  evidence  is  censistent,  the  lack  of  numerous  studies 
using  divergent  types  of  subjects,  instruments  and  topics,  makes  it  diff- 
icult to  reach  a conclusion  regarding  the  effect  of  the  number  of  respense 
alternatives  on  validity. 
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Factors  influencing  subjects'  motivation  to  respond  and  efficiency 
of  response.  T’lis  section  addresses  a series  of  related  matters: 
subject's  preferences  for  and  ability  to  use  scales  with  a varied  number 
of  response  alternatives.  Direct  and  indirect  measures  of  subject  prefer- 
ences and  motivation  have  been  examined.  Direct  measures,  preferences 
ratings,  were  used  in  studies  conducted  by  Matell  (1970)  and  Strahan 
(1971).  However,  both  studies  used  college  students  as  subjects.  Experi- 
mental results  were  consistent:  college  students  reported  a preference 

for  using  finer  scales.  In  Matell's  investigation,  scales  of  nine  to 
thirteen  alternatives  were  preferred,  and  Strahan  reported  significantly 
higher  preferences  for  using  "several"  alternatives  over  a true-false 
format. 

Indirect  or  proxy  measures  of  subject  preferences  and  motivation 
include  response  time  and  number  of  "uncertain"  and  "no  responses."  Matell 
(1970)  presented  evidence  that  no  difference  in  total  time  for  administra- 
tion was  shown  for  two  through  nineteen  alternatives.  Sevan  and  Avant 
(1968)  and  Matell  and  Jacoby  (1972),  however,  reported  that  testing  time 
increased  as  a direct  function  oi  -...le  number  of  alternatives,  thus  support- 
ing the  more  intuitively  plausible  findings.  Concerning  the  relationship 
between  the  use  of  "uncertain"  responoe  categories  or  "no  responses"  and 
scale  length,  the  literature  supported  the  conclusion  that  increasing 
the  number  of  response  alternatives  decreased  uncertain  and  non-responses 
to  scale  items  (Matell,  1970;  Hughes,  1969;  Matell  & Jacoby,  1972; 

Ghiselli,  1939;  Dunette,  Alyward  & Uphoff , 1956;  Tsudzuki,  1953;  Zucker- 
man,  1952).  For  example,  Tsudzuki  (1953)  studied  the  nature  of  non-resp®nse 
in  a two  category  (yes-no)  questionnaire.  This  was  done  by  administering 
the  same  test  to  the  same  group  with  additional  categories  such  as  "in- 
between,"  "cannot  decide,"  and  with  two  different  intensities  of  "agree" 
and  "disagree."  The  latter  method  significantly  reduced  the  percentage  of 
non-response.  Ghiselli  (1939)  noted  that  the  use  of  yes-no  responses 
generally  rated  a product's  advertising  as  less  sincere  than  when  a four- 
step  scale  was  used.  He  felt  that  people  were  more  willing  to  respond  to 
a four-step  scale.  Hughes  (1969)  concluded  that  the  use  of  forced  choice 
scales  results  in  a confounding  of  indifference  and  awareness. 

Efficiency  of  response,  or  the  ability  to  use  scale  paints  in  dis- 
criminating among  objects  and/or  attributes,  and  response  style  (such  as 
yeasaying)  have  also  been  examined  in  relation  to  the  number  of  response 
alternatives.  Matell  and  Jacoby  (1972)  determined  that  the  proportion  of 
scale  used  was  independent  of  the  number  of  response  alternatives.  Several 
studies  suggest  that  yeasaying  tendencies,  as  measured  by  the  proportion 
of  positive  responses,  are  unaffected  by  scale  length  (Goldsamt,  1972; 
Saunders  & Ward,  1964;  Tuclcman  6e  Lorge , 1953). 

The  literature  reviewed  in  this  section  is  mostly  subject  to  the 
limitation  of  being  drawn  from  college  student  populations.  Subjects 
of  different  education,  occupational,  and  age  levels  may  be  less  predis- 
posed to  respond  to  fine  scales. 
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Scoring  and  data  analysis  cansiderations . Scoring  and  data  analysis 
considerations  may  affect  the  selection  of  the  number  of  response  alter- 
natives to  be  used  in  any  given  study.  Several  studies  compared  dichoto- 
mous /trie  ho  tomous  scoring  methods  to  the  normal  summated  scoring  procedures 
and  reached  the  conclusion  that  results  (differences  in  attitudes)  were 
not  affected  by  the  method  of  scoring  (Matell,  1970;  Matell  and  Jacoby, 

1971;  and  Goldsamt,  1972).  However,  for  these  specific  investigation 
situations  two  or  three  response  alternatives  might  have  been  the  optimal 
number  to  employ. 

Several  problems  with  a two  or  three  way  scoring  procedure  exist 
which  are  statistical  in  nature.  If  Chi  Square  tests  are  sufficient, 
two  or  three  categories  might  be  adequate.  However,  if  nonparame tr ic 
rank  order  correlations  are  to  be  employed,  substantial  "ties"  on  ranks 
will  result.  Also,  if  parametric  statistics  are  to  be  employed,  the  more 
alternatives  the  better,  because  of  the  assumption  of  continuous  distribu- 
tions or  interval  scale  properties.  Finally,  another  analytical  issue  of 
concern  with  the  use  of  two  or  three  point  scales  is  the  reproducibility 
of  the  original  data  configuration,  an  issue  important  in  the  use  of  multi- 
dimensional scaling.  Using  simulated  data.  Green  and  Rao  (1970)  demonstra- 
ted that  recovery  is  poor  with  two  or  three  alternatives,  and  that  diminish- 
ing returns  set  in  rapidly  beyond  eight  alternatives.  Other  considerations 
related  to  scoring  questionnaires  are  discussed  in  Chapter  XI. 

Summary  and  Conclusions.  The  state  of  the  literature  was  probably 
best  summarized  by  Ghiselli  and  Brown  in  Personnel  and  Industrial  Psychology 
(1948)  and  by  Guilford  in  Psychometric  Methods  (1954) . These  authorities 
contend  that  the  optimal  number  of  response  alternatives  is  a matter  for 
empirical  determination  in  any  situation,  and  suggest  that  there  is  a wide 
range  of  variation  in  refinement  around  which  the  optimal  point  in  relia- 
bility changes  very  little.  It  Jould  appear,  however,  that  additional 
research  in  the  area  might  be  warranted  covering:  the  different  types  of 

rating  scales;  various  topical  areas  of  research;  and  subjects  with  differ- 
ent ability,  educational,  and  sociodemographic  charac teristics . From 
such  studies  more  information  would  be  available  regarding  the  optimal 
number  of  response  alternatives  to  employ  for  any  specific  investigation 
situation. 


Response  Anchoring 

This  section  contains  a summary  of  research  findings  concerning 
response  anchoring,  including:  types  of  response  anchors;  anchored  versus 

unanchored  scales;  amount  of  verbal  anchoring;  selection  procedures  for 
verbal  scale  anchors;  and  balanced  versus  unbalanced  scales. 

Types  of  response  anchors.  The  researcher's  judgment  has  typically 
determined  whether  response  anchors  are  to  be  verbal,  numerical,  graphic, 
or  some  combination.  In  its  original  form,  the  semantic  differential  was 
in  thci  following  graphic  form  (Osgood,  Suci  & Tannenbaum,  1957).  Respond- 
ents were  instructed  to  place  an  X on  the  line  that  represented  their 
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attitude . 


Strong  : : : : : : : Weak 

This  represents  a combination  graphic  and  verbal  scale.  The  Likert 
method  calls  for  a verbal  rating  (strongly  agree  through  strongly  disagree) 
to  a directional  statement  phrased  either  positively  or  negatively.  For 
example : 

The  Modern  Volunteer  Army  (MVA)  place  too  much  emphasis  on  extrin- 
sic factors  (e.g.,  beer  in  barracks)  as  opposed  to  intrinsic,  job 
related  factors  (e.g.,  pay,  supervision). 

Agree  Strongly  Agree  ^Undecided  Disagree  ^Disagree  Strongly 

Few  studies  were  uncovered  in  the  literature  review  which  systemati- 
cally investigated  the  effect  of  type  of  response  alternative  employed. 

In  a study  to  validate  consumer  attitudes  concerning  various  brands  against 
actual  purchases,  Abrams  (1966;  tested  four  combinations  of  rating  devices; 

1.  Verbal  anchors  with  a -5  through  +5  numerical  continuum,  e.g.: 

Definitely  Definitely 

Dislike  Like 

-5  -4  -3  -2  -1.  0 +1  +2  +3  +4  +5 

2.  Verbal  anchors  with  a 1 through  10  numerical  continuum,  e.g.; 

Definitely  Definitely 

Dislike  Like 

1 23456789  10 

3.  A verbal  and  numerical  continuum,  e.g.: 

Dislike  Dislike  Neither  Like  Like 

Complete-  Some-  Dislike  like  nor  Like  a Some-  Complete- 
ly what  a little  Dislike  Little  what  ly 

1 2 3 4 5 6 7 

4.  Verbal  continua  , e.g.; 

Below  About  A Little  A lot  One  of  None 

Average  Average  Better  Better  the  Best  Better 

Experimental  findings  illustrated  that;  average  scale  scores  are  relative- 
ly constant  regardless  of  scale  type;  scale  4 had  a lower  average  predic- 
tion error  (the  differences  between  predicted  brand  share  and  actual  con- 
sumer purchases);  and  scale  4 had  a far  smaller  amount  of  clustering  of 
responses  at  the  extreme  positive  position.  These  findings  also  confirm 
the  conventions  of  researchers  who  do  not  include  numerical  response  al- 
ternatives in  an  attitude  measurement  scale. 


Several  additional  studies  were  found  which  support  the  use  of  verbal 
anchoring  and  verbally  defined  response  alternatives.  In  a study  employing 


Air  Force  personnel  as  subjects  (Madden,  1964),  three  fonns  of  rating 
scales  were  used:  (1)  each  scale  alternative  was  verbally  defined  and 

illustrated;  (2)  neither  definitions  nor  examples  were  used  (numerical 
scale);  and  (3)  definitions  were  used  but  examples  were  eliminated 
(verbal  scale).  Forms  (1)  and  (3)  were  equally  reliable  and  of  greater 
reliability  than  form  (2).  Form  (3)  was  preferred  because  it  was  simpler 
and  less  time  consuming  for  raters  to  use.  Peters  and  McCormick  (1966) 
compared  the  effectiveness  of  job-task  anchored  (verbal)  equal  appearing 
intervals  scales  and  simple  numerically  anchored  scales.  The  job-task 
anchored  scales  were  found  to  have  significantly  greater  reliability. 

Marsh  and  Perrin  (1925)  compared  the  effectiveness  of  the  graphic 
scale,  percentage  scale,  and  man-to-man  scale.  On  the  graphic  scale, 
raters  underscored  the  description  most  applicable  to  the  subject.  On 
the  percentage  scale  the  raters  placed  a check  mark  in  the  column  repre- 
senting the  subject's  standing,  in  terms  of  the  perceived  amount  of  a 
given  trait  possessed,  with  reference  to  a preliminary  group  of  subjects. 
With  the  man-to-man  scale  the  subjects  were  compared  with  particular  in- 
dividuals representing  the  standards  for  the  traits  rated.  The  results 
failed  to  demonstrate  the  superiority  of  any  one  form  of  scale,  with  the 
range  of  average  deviations  from  agreement  being  extremely  limited  regard- 
less of  the  form  of  scale  employed.  Ross  (1966)  compared  man-to-man  job 
performance  ratings  with  ratings  from  an  anchored  rating  scale  for  their 
validity  in  guiding  salary  decisions  in  a research  and  development  organ- 
ization. The  man-to-man  comparison  procedure  was  found  to  be  as  valid 
as  the  anchored  ratings.  However,  the  two  methods  diverged  in  important 
practical  ways  in  the  results  they  produced. 

Two  other  studies  reported  the  favorability  of  using  verbal  scale 
anchors,  although  neither  compared  verbal  to  other  types  of  anchors.  In 
a study  of  supervisory  style  of  head  nurses.  Smith  and  Kendall  (1963) 
anchored  evaluative  rating  scales  with  examples  of  expected  supervisory 
behavior.  The  examples  were  selected  by  independent  consensus  of  a number 
of  head  nurses.  Scale  reliabilities  ranged  above  .97. 


Barrett,  Taylor,  Parker  and  Martins  (1958)  administered  four  rating 
scale  formats  varying  from  unstructured  to  highly  structured  in  nature. 
Second  line  supervisors  rated  clerical  workers.  The  verbal  format  employ- 
ing trait  titles  and  behavioral  descriptions  of  scale  steps  demonstrated 
higher  inter-rater  reliability,  less  halo,  and  less  leniency  than  did  the 
more  structured  or  less  structured  formats.  It  should  be  noted  that  this 
study  concerned  the  amount  of  verbal  cues  along  a scale. 

Based  upon  the  studies  reviewed  in  this  section,  it  appears  that 
empirical  support  exists  to  conclude  that  the  reliability  of  scales  with 
verbal  anchors  and  verbal  response  alternatives  is  superior  to  that  of 
numerical  and  other  combinations  of  verbal  and  numerical  scales.  Little 
evidence  was  provided  regarding  graphic  rating  techniques.  It  should  be 
noted  ,;hat  none  of  the  studies  addressed  the  issue  of  comparative  validity 
or  subjects'  preferences  and/or  ability  to  use  the  rat'ng  instrument. 
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Anchored  versus  unanchored  scales.  A number  of  studies  have  been 
conducted  on  the  topic  known  as  "anchoring  effects."  This  section  will 
focus  on  differences  in  research  results  caused  by  the  use  of  anchored 
versus  unanchored  scales.  It  should  be  noted  that  all  the  studies 
which  follow  compared  unanchored  scales  to  scales  with  one  end  anchored. 
Many  of  the  studies  varied  the  anchoring  of  the  left  or  right  end  of  the 
scale . 

An  early  study  in  anchoring  effects  in  the  judgment  of  verbal  mater- 
ials was  reported  by  McGarvey  (1943).  Subjects  were  asked  to  scale  state- 
ments about  the  social  prestige  of  occupations  and  undesirable  forms  of 
behavior.  Scales  used  in  the  experiment  were  either  unanchored  or  had 
either  one  of  the  two  extreme  points  verbally  anchored.  Results  indicated 
that  in  unanchored  scales  the  absolute  scale  tended  to  be  "stimulus  anchor- 
ed," i.e.,  anchored  by  tlie  question  stem.  With  either  of  the  end  points 
anchored,  the  tendency  was  to  move  from  the  stimulus  value  on  the  absolute 
scale  toward  the  anchored  extreme.  This  anchoring  effect  has  been  confirm- 
ed by  Rogers  (1941)  and  Volkman  (1936).  Rogers'  study  also  examined  con- 
fidence in  ratings  and  judgment  time.  Confidence  was  only  slightly  affect- 
ed due  to  anchoring,  but  was  found  decreasing  in  higher  categories  nearer 
the  anchor.  Judgment  time  decreased  with  anchoring.  In  a reexamination 
of  Volkman 's  experiment.  Hunt  and  Volkman  (1937)  were  unable  to  conclude 
with  certainty  that  an  anchor  effect  existed.  An  incomplete  shift  in  scale 
values  occurred.  That  is,  the  movement  did  not  include  the  subject's  own 
stimulus  anchor,  his  most  pleasant  color. 

Several  examples  of  conflicting  studies  to  the  above  anchor  effect 
investigations  are  available.  For  example,  Weiss  (1961)  used  two  separate 
experimental  groups  in  a study  concerning  attitude  toward  delinquents; 
the  experiments  cited  above  used  the  same  subjects  with  an  anchored  scale 
following  an  unanchored  rating.  One  group  was  given  an  extreme  statement 
as  a standard  for  a punitive  category,  and  the  other  was  given  no  stand- 
ard. A contrast  effect,  movement  in  ratings  away  from  the  anchor  state- 
ment, was  produced  by  the  extreme  standard.  Hunt  (1941)  offered  evidence 
supportive  of  this  contrast  effect.  In  conclusion  to  this  experimental 
study,  he  commented:  "if  judgments  made  with  an  unanchored  scale  be  re- 

peated with  the  scale  anchored  by  the  further  definition  of  one  extreme, 
there  is  a shift  in  the  average  value  of  the  stimulus  judgments,  and  this 
shift  is  in  a direction  away  from  the  anchoring  value."  This  would  mean 
that  when  a scale  was  anchored  at  its  low  or  negative  extreme,  the  ratings 
tei.d  to  rise  or  be  more  positive,  and  vice  versa. 

Frawlcy  (1943)  had  115  seminarians  rate  100  statements  about  war, 
belief  in  God,  birth  control,  etc.  His  experimental  procedure  was  similar 
to  the  previously  cited  studies  --  subjects  rated  the  statements  first 
on  an  unanchored  scale  and  the  next  day  rated  them  again  on  a scale  with 
the  most  unfavorable  end  of  the  scale  anchored.  The  fact  that  the 
Spearman  rank  order  correlations  were  extremely  high  for  the  two  sets  of 
data  indicated  minimal  presence  of  anchor  effects. 
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Because  of  this  conflicting  evidence  it  cannot  be  concluded  that 
use  of  a single  verbal  anchor  produces  anchor  effects,  contrast  effects, 
or  indeed  any  significant  differences  in  rating  scale  average  scores. 

Most  of  the  studies  seem  limited  due  to  the  use  of  the  same  subjects  in 
simple  before  (unanchored)  and  after  (anchored)  experimental  designs.  It 
must  also  be  noted  that  the  investigations  cited  in  this  area  were  theore- 
tical inquiries  into  common  principles  or  "laws"  governing  ratings  or  judg- 
ments, and  were  less  concerned  with  strengths  and  weaknesses  of  types  of 
scales.  Finally,  the  fact  that  not  one  of  the  investigations  cited  employ- 
ed verbal  anchors  at  both  ends  of  the  scale  mfkes  it  even  more  difficult  to 
conclude  whether  anchored  or  unanchored  scales  should  be  employed. 

Amount  of  verbal  anchoring.  A few  research  studies  have  addressed 
the  issue  of  the  effects  of  varying  the  amount  of  verbal  anchoring  on 
rating  scales.  Bendig  (1953  ) examined  the  impact  of  amount  of  verbal 
anchoring  in  a study  where  225  college  students  rated  themselves  as  to 
how  much  they  knew  about  12  foreign  countries.  Alternative  scales  presen- 
ted to  subjects  had  the  center  category  defined,  the  end  categories  de- 
fined, or  both  center  and  end  categories  defined.  Results  indicated  that 
the  reliability  of  the  scales  increased  with  added  scale  anchoring.  In  a 
separate  report  of  the  same  data  base,  Bendig  and  Hughes  (1953)  concluded 
that  increased  verbal  anchoring  also  resulted  in  a slight  increase  in  the 
amount  of  information  transmitted  by  the  scale.  The  use  of  "information 
transmitted"  here  is  in  the  context  of  information  theory,  meaning  more 
descriptive  data  from  the  respondents. 

Abrams  (1966)  compared  seven  point  scales  with  only  the  end  categor- 
ies defined  to  scales  with  the  entire  continuum  verbally  anchored.  The 
study  examined  consumer  mail  panel  respondent's  attitudes  toward  national 
brands  of  toothpaste  and  scouring  cleanser.  Shelf  inventories  of  actual 
purchases  of  products  in  these  categories  were  conducted  in  a follow-up. 
Scales  which  had  verbal  descriptors  for  all  response  alternatives  were 
better  predictors  of  purchase  behavior.  These  scales  also  displayed 
greater  respondent  use  of  the  range  of  response  alternatives,  with  less 
clustering  at  the  extreme  positive  position. 

Another  study  in  support  of  increased  verbal  anchoring  was  offered 
by  Madden  (1964) . Four  job  evaluation  factors  were  used  as  the  basis  of 
rating  10  Air  Force  specialties.  For  each  factor  three  different  methods 
of  anchoring  were  used:  (1)  each  response  alternative  was  defined  and 

illustrated;  (2)  neither  definitions  nor  examples  were  used;  and  (3)  defi- 
nitions were  used  but  examples  were  omitted.  Methods  (1)  and  (3)  were 
approximately  equal  in  reliability,  both  yielding  more  reliable  scales 
than  Method  (2). 

Only  one  empirical  study  was  uncovered  which  offers  evidence  some- 
what in  conflict  with  the  above  findings.  Carter,  Ruggels,  and  Chaffee 
(1968)  conducted  an  experiment  using  15  semantic  differential  scales  used 
to  describe  12  objects  (concepts  about  schools) . One  hundred  and  thirty- 
five  female  teachers  wore  given  the  opportunity  to  modify  the  scales  during 
rating.  Assuming  that  not  every  adjective  scale  was  a useful  descriptor, 
subjects  wore  given  one  polar  adjective  for  each  scale  and  were  asked  to 


fill  in  the  appropriate  opposite  or  note  "wouldn't  use  scale"  or  "don't 
know."  The  authors  concluded  that  for  four  of  the  15  scales,  the  polar 
opposite  chosen  by  the  subjects  was  not  suggested  by  Osgood.  Also,  the 
authors  noted  that  whatever  the  merits  of  anchoring  both  ends  of  every 
scale  to  measure  meaning,  it  appeared  that  subjects  can  more  accurately 
devote  their  descriptions  to  objects  when  one  end  of  the  scale  is  left 
for  them  to  describe.  Perhaps  this  study  raises  more  doubts  about  the 
validity  of  the  semantic  differential  technique  than  it  offers  concrete 
evidence  regarding  the  amount  of  verbal  anchoring. 

In  conclusion,  the  limited  number  of  research  studies  cited  above 
are  somewhat  consistent  in  reporting  greater  scale  reliability  with  added 
verbal  anchoring.  Also,  one  experiment  (Abrams,  1966)  offered  evidence 
of  higher  predictive  validity  with  more  verbal  anchoring. 

Selection  procedures  for  verbal  scale  anchors.  This  section  presents 
literature  which  dealt  with  appropriate  procedures  for  the  selection  of 
verbal  scale  anchors.  The  studies  relate  to  verbal  anchors  used  in 
Likert  scales,  semantic  differential  scales,  and  rating  scales. 


A complete  bipolar  adjective  screening  methodology  for  semantic  dif- 
ferential scales  has  been  outlined  by  Lusk  (1973) . This  procedure  seems 
highly  applicable  to  any  research  considering  the  use  of  the  semantic  dif- 
ferential. The  process  suggested  is  as  follows: 

1.  Select  from  Osgood's  Thesaurus  Study  a set  of  bipolar  adjectives 
for  each  factor  dimension,  evaluative,  potency,  and  activity,  appli- 
cable to  the  study. 

2.  Select  a pretest  sample  representative  of  the  final  population  in 
the  study. 

3.  Prepare  the  pretest  concept/adjective  test  blocks,  randomizing  the 
concepts  and  bipolar  adjectives. 

4.  Administer  the  pretest  semantic  differential  scales  and  compare 
variances  (test  for  differences)  from  the  scale  midpoint.  The 
objective  here  is  to  eliminate  a preponderance  of  raidinterval 
responses,  of  each  bipolar  adjective  for  each  concept  evaluated. 

5.  Order  the  bipolar  adjectives  based  upon  their  respective  vari- 
ances, high  to  low. 

6.  Select  the  required  number  of  bipolar  adjectives,  i.e.,  those  sets 
with  significantly  lower  variances,  F test,  from  midpoints  may  be 
e limina  ted . 

One  additional  insight  into  the  selection  of  bipolar  adjectives 
should  be  mentioned.  Carter,  Ruggels,  and  Chaffee  (1968)  reported  that 
bipolar  adjectives  such  as  sweet-ferocious  were  of  little  value  in  rating 
inanimate  objects,  such  as:  "Are  boulders  sweet-sour?"  But  they  were 

useful  with  relational  concepts.  Again,  caution  must  bo  exercised  in 
selecting  bipolar  adjectives  or  phrases. 


Smith  and  Kendall  (1963)  tested  a procedure  for  the  construction  of 
evaluative  rating  scales  anchored  by  examples  of  expected  behavior. 
Expectations,  based  on  having  observed  similar  behavior,  were  used  to 
permit  rating  in  a variety  of  situations.  Examples,  submitted  by  head 
nurses  as  illustrations  of  nurses’  behavior  related  to  a given  dimension, 
were  retained  only  if  reallocated  to  that  dimension  by  other  head  nurses. 
They  were  then  scaled  as  to  desirability.  Agreement  for  a number  of 
examples  was  high,  and  scale  reliabilities  ranged  above  .97. 

In  general,  the  interpretation  of  the  above  studies  is  that  pretests 
for  the  selection  of  verbal  anchors  are  valuable  in  building  scale  content 
validity  and  reliability.  Rather  than  employing  anchors  which  seem  appro- 
priate to  the  researcher,  anchors  should  preferably  be  selected  by  respon- 
dents similar  to  those  who  will  be  participating  in  the  study. 

Balanced  versus  unbalanced  scales.  Historically,  the  balanced  scale 
has  been  preferred  by  researchers.  A scale  is  balanced  when  it  has  an 
equal  number  of  response  alternatives  on  either  side  of  the  scale's 
"indifferent''  category.  For  example,  the  following  verbal  scale  is  bal- 
anced : 

How  would  you  describe  the  Volunteer  Army? 

Very  Progressive 

Progressive 

Moderately  Progressive 

Neither  Progressive  nor  Conservative 

Moderately  Conservative 

Conservative 

Very  Conservative 

Unbalanced  scales  have  been  employed  when  pretest  results  demonstrated 
that  subjects,  by  using  extreme  response  alternatives  on  a scale,  produced 
a skewed  distribution  of  responses  rather  than  the  statistically  desirable 
normal  distribution  around  the  mean  attitude.  To  minimize  "end  piling," 
unbalanced  scales  have  been  used.  More  response  alternatives  are  added  to 
the  end  of  the  scale  where  the  piling  is  likely  to  occur.  This  practice 
tends  shift  the  distribution  of  responses  along  the  scale  continuum. 

For  example,  the  following  scale  is  heavily  unbalanced  on  the  favorable 
end : 


What  is  your  reaction  to  the  "beer  in  the  barracks"  policy? 

Enthusiastic 

Extremely  Favorable 

Very  Favorable 

Favorable 

Fair 

Poor 

Only  one  empirical  study  was  found  in  the  literature  review  which 
dealt  with  the  comparative  effects  of  balanced  versus  unbalanced  scales. 
In  the  study  (Weiss,  1963b),  350  college  students  judged  the  social 
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prestige  of  400  occupations.  Four  types  of  scales  were  examined,  each 
with  a zero  category  designating  average  prestige.  They  were: 

1.  Three  category  balanced  with  an  equal  number  of  plus  and  minus 
categories . 

2.  Seven  category  balanced  with  an  equal  number  of  plus  and  minus 
categories . 

3.  Five  category  unbalanced  with  a single  plus  category. 

4.  Five  category  unbalanced  with  a single  minus  category. 

The  author  concluded  that  relative  to  the  balanced  scales,  the  unbalanced 
scales  induced  a shift  in  the  prestige  value  of  the  "average"  category  in 
the  direction  of  the  single-nondiscriminating  category.  In  other  words, 
significant  differences  between  scales  occurred.  Unfortunately,  this  in- 
vestigation did  not  report  data  relating  to  the  comparative  reliability  or 
validity  of  balanced  and  unbalanced  scales. 

Based  upon  a single  study,  obviously  no  conclusions  can  be  drawn  re- 
garding the  use  of  balanced  or  unbalanced  scales.  Intuitively,  the  use 
of  balanced  scales  seems  to  be  warranted  to  avoid  biasing  results  with  the 
presence  of  more  favorable  (or  unfavorable)  response  categories. 
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ORDER  OF  PERCEIVED  FAVORABLENESS  OF  COMMONLY 
USED  WORDS  AND  PHRASES 


This  chapter  is  concerned  with  the  words  or  phrases  which  are 
commonly  used  as  response  alternatives  in  questionnaire  items.  It  is 
often  necessary  to  arrange  the  words  or  phrases  along  some  continuum  or 
in  some  order  of  degree,  and  several  studies  have  been  conducted  to  estab- 
lish this  order.  These  studies  will  be  discussed  in  this  chapter.  Studies 
concerned  with  determining  the  perceived  favorableness  of  words  and  phrases 
are  described  below  in  terms  of  the  instruments  used,  type  and  number  of 
subjects,  and  method  of  determining  the  scale  values.  When  available,  lists 
of  words  or  phrases  and  scale  values  have  been  included. 


Major  Studies  and  Lists  of  Adjectives  and  Scale  Values 


One  of  the  first  studies  on  the  perceived  favorableness  of  adjectives  was 
conducted  by  Hosier  (1940,  1941a).  In  this  study  296  adjectives  were  rated 
on  an  11  point  scale  anchored  at  1 with  "most  unfavorable,"  at  6 with 
"neither  favorable  nor  unfavorable,"  and  at  11  with  "most  favorable."  Zero 
was  used  if  the  adjective  could  not  be  rated.  Each  adjective  or  adjective 
phrase  was  judged  by  approximately  140  students  from  introductory  or  second 
year  psychology  courses.  Twenty-six  of  the  296  words  were  scaled  by 
Thurstone's  method  of  successive  intervals,  using  the  stimulus  "completely 
unsatisfactory"  as  the  standard,  with  its  mean  at  zero  and  its  standard 
deviation  equal  to  one.  The  medians,  scale  values,  and  standard  deviations, 
for  these  26  words  are  given  in  Table  VII-1  (Hosier,  1940).  The  method  of 
equal  appearing  intervals  was  also  used  to  find  the  scale  value  for  each  of 
the  296  words  A sample  list  of  14  words  is  shown  in  Table  VII-2.  In  this 
study  correlation  coefficients  were  computed  on  six  words  (neutral,  normal, 
excellent,  desirable,  disgusting,  and  unsatisfactory)  which  were  repeated 
in  the  list  presented  to  the  subjects.  The  correlation  coefficients  for 
these  words  ranged  from  .90  to  .99.  In  Hosier's  list  there  were  26  words 
that  could  not  be  rated  by  20  or  more  subjects.  These  words  are  listed  in 
Table  VII-3.  Some  of  the  words  also  exhibited  marked  bimodality  of  response, 
and  these  are  shown  in  Table  VII-4.  Complete  tables  showing  the  results  of 
this  study  were  privately  issued  by  Hosier  (1941b),  but  this  list  was  un- 
available for  review. 


Hosier's  research  also  studied  the  effect  of  usual  adverbal  intensives. 
A set  of  five  words  were  selected  and  these  words  were  repeated  with  each 
of  seven  intensives.  The  results  of  the  study  are  given  in  Table  VII-5. 

Four  of  the  seven  adjectives  selected  are  arranged  across  the  top  of  the 
table,  each  heading  a column.  The  fifth  adjective,  "indifferent,"  behaved 
atypically  because  of  ambiguous  associated  context.  Each  row  of  the  table 


TABLE  VII- 1 


Scale  Values  of  Standard  Set  of  Words, 
(Hosier,  1940) 


Stimulus 

Md. 

EOi 

kb 

Co'npU-tcly  unsatisfactory 

1.6 

O.M 

l.uO 

Wry  unsatisfactory 

2.r, 

0.75 

0.W 

Catastrophic 

2.5 

0.01 

0.81 

Treacherous 

2.7 

1.05 

0.62 

Menacing 

2.» 

1.14 

0.96; 

Pisc«>u  racing 

3.5 

1.42 

0.49 

Parnful 

2.« 

1.43 

0.54 

Urprohtible 

4.3 

1.72 

0.62 

Rejected 

4.6 

1.71 

0.54 

Disputable 

5.7 

2.42 

9.69 

Normal 

C.7 

2.47 

1.46 

Satiating 

6.2 

2.7# 

1.54 

Reconcilable 

6.3 

3.30 

9.75 

Blameless 

7.6 

3.64 

0.90 

Solscing 

S.O 

3.7# 

0.51 

Ordinary 

6.5 

3.63 

1.43 

Bonny 

2.4 

3.97 

0.61 

Decent 

t.S 

4.06 

0.61 

Preferable 

9.0 

4.30 

0.S5 

Prontable 

9.4 

4.40 

0.47 

Popular 

9.7 

4.55 

0.49 

Successful 

10.0 

4.65 

9.54 

Sublime 

10.3 

4.00 

OMi 

Superior 

10.4 

4.91 

6m\ 

Completely  agreeaUe 

10.1 

4.95 

111 

6M 

™1?1 

TABLE  VlI-2 

Scale  Values  of  Selected  Words 
(Hosier,  1941a) 


Stlmului 

Scale  Value  ^ 

Completely  uniatisfsctory 

000 

Repulsive 

O.SO  1 

Di  igrscef ul 

1.00  1 

Wrong 

1J» 

Unnecessary 

2.00 

Dirputsbie 

2.S4 

Evcutabic 

an  ' 

Average 

1.04  1 

Pardonable 

S.4I  . 

Comfortable 

4.0* 

Desirable 

4.S0 

Highly  agreeable 

S.02 

Divine 

S5# 

Very,  very  desirable 

5.44 
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TABLE  VII- 3 

Words  Marked  "Unable  to  Rate"  by  20  or  More  Subjects 
(Mosier  1941a) 


Abhorred 

Adverse 

Bonny 

Calamitous 

Cloying 

Debased 

Despicable 


Ecstatic 

Estimable 

Expedient 

Inflaming 

Iniquitous 

Noxious 

Odious 


Ominous 

Peerless 

Pernicious 

Persuasive 

Perverse 

Pestilential 


Propitious 

Satisfying 

Seductive 

Seemly 

Solacing 

Superlative 


TABLE  VII-4 

Words  Exhibiting  Marked  Bimodality  of  Response 
(Mosier  1941a) 


Acccpiahle 

Am.T/iiiK 

AppaUinf^ 

BoraMc 

6««vi(chfn^ 

Choice 

Important 

iDdiflereQt 


('ntnpiciely  iniitftrrcnt 
iotliffercnt 
H'mhlv  iiKiiHereni 

Quilt*  iniiiflrri'iit 
I nu>ually  inJifferent 
\'er>  indiifrrent 
Very,  virv  inJiffereni 

Inflaiuii.K 

Indikpcn^alile 


IrrciittfMc 

Norm*l 

PeerlvM 

SatUtMl 

Seductivt 

Sublime 

Tempo'nK 

tinlu 

Qn«pcat«hle 


Note.  Words  marked  with  asterisks  also 
appear  in  Table  Vll-1. 


TABLE  VII-5 

Scale  Values  as  Affected  by  Adverbial  Modifiers 
(Mosier  1941a) 


Modifier 

De*irai)lc 

Axrecabie 

Poor 

Uruaiialaciofy 

4.50 

4 19 

1.60 

1.47 

4.76 

4.45 

l.SO 

too 

Very 

496 

4.12 

1.11 

0.75 

Vntifually 

5.21 

4.S6 

0.95 

0.75 

Compiefely 

S.14 

4.96 

0.92 

000 

Hiizhly 

S.IS 

5.02 

S.42 

5.10 

0.95 

0.10 

Very,  very 

5.66 

5.14 

0.55 

0.25 

VII-3 


presents  the  scale  values  for  one  of  the  adverbial  modifiers  studied. 

Jones  and  Thurstone  (1955),  in  order  to  determine  the  degree  of  like 
or  dislike  denoted  by  an  adjective  or  phrase,  had  905  enlisted  personnel 
rate  51  descriptive  words  and  phrases  on  a nine  point  scale  anchored  with 
"greatest  dislike"  at  the  left,  "neither  like  nor  dislike"  in  the  center, 
and  'greatest  like"  at  the  right.  For  each  item  a scale  value  was  de- 
termined by  the  method  of  successive  intervals  and  a standard  deviation 
was  computed.  The  51  word  phrases  are  given  in  Table  VII-6. 

Myers  and  Warner  (1968)  conducted  a study  in  which  50  commonly  used 
statements  describing  product  taste  or  ad  effectiveness  were  rated  on  a 
21  point  Thurstone  equal  interval  scale  with  the  top  category  captioned 
"This  is  the  best  thing  I could  say  about  the  (peraon,  product,  or  ad)." 

The  bottom  and  opposite  category  was  "This  is  the  worst  thing  I could  say 
about  the  (person,  product,  or  ad)."  The  judges  were  25  housewives,  36 
business  executives,  40  graduate  business  administration  students,  and  25 
undergraduate  business  administration  students.  For  each  statement  the 
mean  scale  values  and  standard  deviations  were  computed.  The  50  statements 
are  given  in  Table  VII-7. 

Cliff  (1959)  reported  on  a study  which  derived  scale  values  for  150 
evaluative  words  and  phrases.  The  list  of  stimuli  used  15  unmodified 
adjectives  plus  all  combinations  of  them  and  nine  intensity  adverbs.  Two 
hundred  thirteen  students  in  introductory  psychology  courses  at  Wayne 
State  University,  183  at  Princeton,  and  129  at  Dartmouth  rated  the  words 
and  phrases,  on  an  11  point  scale  anchored  by  "most  unfavorable"  at  the 
left,  "neutral'  in  the  center,  and  "most  favorable"  at  the  right.  The 
referent  of  the  items  was  "favorable  or  unfavorable  opinions  about  people." 
Scale  values  were  derived  by  the  least  squares,  successive  interval  method. 
The  scale  values  of  the  adverb-adjective  combinations  are  shown  in  Table 
VII-8.  The  adverb  and  adjective  values  matrices  are  shown  in  Table  Vll-9 . 

Altemeyer  (1970)  conducted  two  studies  in  which  numerical  values  were 
assigned  adverb-verb  combinations.  In  the  first  study,  392  Introductory 
psychology  students  rated  eight  adverb-verb  combinations  on  a seven  point 
scale  with  values  from  minus  three  to  plus  three.  In  the  second  study, 

194  introductory  psychology  students  assigned  numerical  values  to  nine 
adverb-verb  combinations  on  a four  point  scale  ranging  from  zero  to  plus 
three.  Plus  three  was  labeled  either  "completely  agree"  or  "strongly  agree." 
The  mean  ratings  of  the  verbal  phrases  obtained  for  both  studies  are  listed 
in  Table  VII-10. 

Dodd  and  Gerberick  (1960)  presented  sets  of  word  phrases  to  groups 
of  subjects  who  were  to  place  each  item  on  a nine  point  scale.  For  each 
group  of  words  the  median  scale  position  was  calculated.  Table  VIl-11 
shows  the  scale  positions  for  34  phrases  rated  by  40  subjects.  Table 
VII-12  shows  the  median  scale  positions  for  47  intensity  phrases  tested 
in  series  context.  Table  VII-13  shows  the  findings  from  100  judges  for 
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TABUS  VII-6 


Scale  Values  and  Standard  Deviations  of  Stimulus  Items 
(Jones  and  Thur stone , 1955) 


Item 

Scale 

Value 

SD 

Item 

Scale 

Value 

SD 

Ht  sl  of  all 

6.1S 

2.48 

Mildly  like 

.85 

.47 

lavorilc 

4,f)S 

2.11 

Fair 

.78 

.85 

Like  extremely 

4.16 

1 62 

.Acceptable 

.73 

.66 

Ijkc  intensely 

4.0S 

1.59 

Only  fair 

.71 

.64 

Kxccilcnt 

3.71 

l.Ol 

Like  slightly 

.69 

.32 

Neutral 

.02 

.18 

Wontlcrful 

3.51 

. .97 

Like  not  so  well 

-.30 

1.07 

Strongly  like 

2.90 

.69 

Like  not  so  much 

-.41 

.94 

Like  very  much 

2.91 

.60 

Dislike  slightly 

-.59 

.27 

Michty  fine 

2.88 

,67 

Mildiv  dislike 

-.74 

.35 

K';|ieciallj  goorl 

2.86 

.82 

Not  pleasing 

- .8.1 

67 

Highly  favorable 

2.81 

,60 

Don't  care  for  it 

- 1.10 

Like  very  well 

2.00 

,78 

Dislike  moderately 

-1.20 

.41 

Very  good 

2.50 

.87 

Poor 

-1..SS 

.87 

Like  quite  a bit 

2.32 

,52 

Dislike 

-1  58 

.94 

nnjov 

2.21 

.86 

Dtm’l  like 

-1.81 

.97 

Preferred 

1.98 

1.17 

Mad 

-2.02 

..80 

1.91 

.76 

1 Highly  unfavorable 

-2  16 

1.17 

Welcome 

1.77 

1.18 

1 Strongly  dislike 

-2..17 

.5,1 

Tasty 

1.76 

.92 

Dislike  very  much 

-2.49 

.64 

Pleasing 

1.58 

.05 

' Very  had 

-2.53 

M 

1 Terrible 

-3  09 

-9.S 

Like  fairly  \vell 

1 51 

.59 

} Dislike  intensely 

-3.33 

1.39 

Like 

1.35 

.77 

1 Lnath 

-3.76 

3,54 

Like  moderately 

1.12 

.61 

! Dislike  extremely 

-4.32 

1.86 

OK 

.87 

1.24 

1 

Average 

.8« 

1.08 

Despise 

-C.44 

3.62 
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Means  and  Standard  Deviations  of  Commonly  Used  Statements 
(Myers  and  Warner,  1968) 
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Obtained  Successive  Intervals  Scale  Values  of  Adverb-Adjective  Combinations 


TABLE  VII-9 


Adverb  and  Adjective  Value  Matrices 
(Cliff,  1959) 

Way 

ne 

Princeton 

Dar 

tmouth 

Actual 

Expected 

Actual 

Expected 

Actua  1 

Expected 

Adverb 

Va  lue 

Value 

Value 

Va  lue 

Value 

Value 

(Unmodified) 

1.000 

.987 

1 .000 

.993 

1.000 

.991 

Slightly 

.555 

1.000 

.559 

.999 

.538 

1.003 

Somewhat 

.685 

.997 

.719 

1.001 

.662 

.995 

Rather 

.846 

1.015 

.887 

1.014 

.843 

1.016 

Pretty 

.935 

.995 

.961 

.994 

.878 

.992 

Quite 

1.042 

.994 

1.109 

.988 

1.047 

.991 

Decidedly 

1.216 

.997 

1.231 

.996 

1 . 165 

.992 

Unusually 

1.291 

1.010 

1.324 

1.001 

1.281 

1 .010 

Very 

1.317 

1.008 

1.323 

1.007 

1.254 

1.002 

Extremely 

1.593 

.996 

1.546 

.997 

1 .446 

1.006 

Adjective 

Evil 

-1.246 

2.082 

- .989 

1.918 

- .993 

1.972 

Wicked 

-1.158 

1.952 

- .951 

1.848 

- .997 

1.910 

Contemptible 

- .913 

1.746 

- .826 

1.749 

- .882 

1.792 

Immoral 

-1.177 

1.936 

-.931 

1.878 

- .954 

1.910 

Disgusting 

- .806 

1.617 

-.801 

1.621 

- .902 

1.715 

Bad 

-1 .025 

2.032 

-.972 

2.051 

- .796 

1.907 

Inferior 

- .813 

2.008 

- .923 

2.077 

- .861 

2.037 

Ordinary 

- .078 

2 .083 

- .253 

2.100 

- .223 

2.182 

Average 

- .040 

2.121 

- .296 

2.254 

- .211 

2.195 

Nice 

1.007 

1.742 

.984 

1.842 

1 .011 

1.739 

Good 

1.078 

1.752 

1.158 

1.777 

1.075 

1.  /61 

Pleasant 

1 .001 

1.835 

1.050 

1.856 

.974 

1.860 

Charming 

.802 

2.136 

.895 

2.116 

.910 

2.013 

Admirable 

.983 

2.001 

1.170 

1.892 

1.086 

1.892 

Lovable 

.836 

2.173 

.912 

2.108 

.812 

2.207 

TABLE  VII- lO 
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Numerical  Ratings  of  Adverb-Verb  Combinations 
(Altemeyer,  1970) 


Adverb 

Study  1 

Study  2 

Disagree  l 

1 Agree  | 

Agree 

M 

1 

SO 

** 

r 

SD 

M 

SO 

Slightly 

-.64 

.38 

.67 

.36 

0.62 

I .31 

Substantially 

-2.17 

.51 

2.10 

.50 

2.08 

' 49 

Moderately 

-1.35 

.42 

1.47 

•41 

1.49 

1 .38 

Somewhat 

-.93 

.47 

.94 

.41 

.91  1 

1 -42 

Quite 

-2.16 

.57 

2.37 

.49 

2.23  1 

.46 

Considerably 

-2.17 

.45 

2.21 

.42 

2.18  1 

.40 

Perhaps 

-.43 

.46 

.52 

.46 

.44 

43 

Decidedly 

-2.76 

.43 

2.77 

.41 

2.74 

.47 

Mildly 

.98 

.41 

TABLE  Vll-11 


Scale  Positions  for  Thirty-four  Phrases 
(Dodd  and  Gerber ick,  1960) 


Degree  phrases,  tested 
out-of -context 

complete  

ilflioit  complete 

very  much  more 

much  more  ............ 

• lot  more  

• good  deal  more  ... 

more  

•omewbit  more 

■ little  more  

lUthtly  more  

DOW  

As  AT  PRESENT 

ili(htly  lets 

■ little  lets  

tomewhat  

less  

much  leu  

■ good  deal  leu 

a lot  leu 

very  little  

almoit  tMoe  

very  much  lett  

Done 


Median 


S.IS 

8.06 

8.02 

7.67 

7.50 

7.29 

6.35 
6.25 
6.00 
5.9* 
5.03 
5.00 
3.97 

3.96 
3.7y 
3.64 
2.55 
2.44 

2.36 
2.08 
2.04 

1.96 
I.II 


Temporal  frequency  phrases, 
tested  out-of -context 

always  

without  (ail 

often  

usually  

frequently  

DOW  and  then 

•ometiiiia  

oocaaionally  

aeldoa  

lately  

never  


8.99 

8.89 

7.23 

7.17 

6.92 

4.79 

4.78 

4.13 

2.4S 

2.08 

1.00 


Scale  Positions  of  47  Intensity  Phrases 
(Dodd  and  Gerber ick,  1960) 


r 


I 

I TABLE  VII-13 

[ 

[ Stability  of  Intensity  Phrases  in  Diverse  Contexts 

(Dodd  and  Gerberick,  1960) 


Intensity 

Phrase 

Issue 

i: 

number  of 
responses 

Scale 

position 

Mean  scale 
position 

Very  strongly 

1 

295 

8.96 

2 

197 

8.91 

8.92 

3 

161 

8.91 

Strongly 

1 

269 

7.01 

2 

162 

7.20 

7.11 

3 

271 

7.12 

Moderately 

1 

305 

4.78 

2 

189 

4.77 

4.82 

3 

ni 

4.92 

Indifferent 


(Insufficient  data) 


the  phrases  of  Subset  1 of  Table  VII-12  on  strength  of  feeling,  when 
presented  in  graded  series  and  as  applied  to  31  scale  statements  about 
three  issues.  The  three  issues,  respectively,  were;  resistance  to  start- 
ing a war;  drafting  of  women  for  militi.ry  service  and  defense  work;  and 
amount  of  government  control.  As  seen  in  the  table,  Dodd  and  Gerber ick 
found  that  the  diversity  of  context  does  not  appreciably  shift  the 
scores  of  the  intensity  phrases. 

In  a study  conducted  by  Anderson  (1968)  a sample  of  100  college 
students  rated  555  personality  trait  words  on  likableness  as  a personality 
characteristic.  The  words  were  on  a seven  point  scale  with  zero  being 
defined  as  "least  favorable  or  desirable"  and  6 as  "most  favorable  or 
desirable."  The  words  were  also  rated  for  meaningfulness  by  50  subjects 
on  a scale  that  ranged  from  zero  ("1  have  almost  no  idea  of  the  meaning 
of  this  word")  to  4 ("l  have  a very  clear  and  definite  understanding  of 
the  meaning  of  this  word") . Table  VII- 14  shows  the  list  of  words  in 
order  of  likableness.  The  first  entry  for  each  word  is  its  likableness 
value,  listed  in  the  column  headed  "L" . The  L value  is  the  sum  of  the 
ratings  of  the  100  subjects  so  the  mean  may  be  obtained  by  inserting  a 
decimal  point.  The  second  entry  of  the  table  in  the  column  headed  s^ 
is  the  variance  of  the  likableness  ratings. 

In  a recent  study  (Matthews,  Wright,  & Yudowitch,  1975)  a list 
of  141  adjective  phrases  showing  degrees  of  adequacy,  acceptability,  and 
comparison  were  administered  to  enlisted  men  and  officers  at  Fort  Hood. 

The  adjective  phrases  were  rated  on  an  eleven  point  scale  with  -5 
anchored  with  "most  unfavorable",  zero  anchored  with  "neither  unfavorable 
nor  favorable",  and  +5  with  "most  favorable".  Means  and  standard  devia- 
tions were  computed  for  each  adjective  phrase.  There  were  about  50 
usable  judgements  for  each  phrase.  The  results  of  this  study  are  shown 
in  Tables  VIl-15,  VII-16,  and  VII-17 . 

Currently,  The  U.S.  Army  Test  and  Evaluation  Command  (1973)  is 
carrying  out  a project  part  of  which  included  the  scaling  of  32  adjectives 
and  adjective  phrases.  Average  scale  scores  and  standard  deviations  were 
computed  for  the  list  of  adjectives.  The  32  adjectives  and  adjective 
phrases  are  shown  in  Table  VII-18. 

Simpson  (1944)  studied  the  commonly  held  meaning  of  20  words  de- 
noting frequency  by  having  335  high  school  and  college  students  respond 
to  how  many  items  out  of  100  each  word  in  a list  indicated.  The  results 
are  presented  in  Table  VII-19. 

A significant  study  conducted  by  Mittelstaedt  (1971)  compared  the 
results  of  the  Jones  and  Thurstone  (1955),  Cliff  (1959),  and  Myers  and 
Warner  (1968)  studies.  The  Jones  and  Thurstone  and  the  Myers  and  Warner 
studies  had  13  stimuli  in  common.  Values  for  the  same  13  stimuli  (treat- 
ing them  as  a "scale")  were  taken  from  each  of  the  Myers  and  Warner  groups 
(housewives,  executives,  graduate  students,  and  undergraduates).  Product 
moment  correlation  coefficients  between  the  Jones  and  Thurstone  "scale" 
and  the  values  for  each  of  the  Myers  and  Warner  groups  were  then  calcu- 
lated. The  results  are  shown  in  Table  VII-20.  Eleven  stimuli  in  Cliff's 


VlI-13 


A 
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study  were  the  same  as  11  In  the  Myers  and  Warner  study.  As  before  the 
11  items  for  each  of  Cliff's  study  groups  were  treated  as  a "scale" 
and  were  compared  with  a "scale"  constructed  using  the  same  stimuli  for 
each  of  the  four  Myers  and  Warner  subject  groups.  The  product  moment 
correlation  coefficients  for  each  Cliff  group  with  each  Myers  and  Warner 
group  are  presented  in  Table  VII“21.  As  may  be  seen  in  the  tables, 
Mittelstaedt  found  a remarkable  correspondence  among  the  scale  values 
from  the  three  studies  in  spite  of  differences  in  time,  place,  subjects, 
instrumentation,  instructions,  referents,  and  context. 


TABLE  VII- 14 


Ratings  of  Likableness,  and  Likableness  Variances  for  555  Conmon 
Personality  Traits  Arranged  in  Order  of  Decreasing  Likableness 

(Anderson,  1968) 


WoTll 

v,^ 

Word 

L 

' since. 0 

5Li 

.30 

con.;cicntjnuS 

431 

.82 

1 

.17 

resourceful 

4SI 

.74 

j uiulcrst.u'uiing 

al't 

..52 

alert 

430 

.65 

! iov.u 

517 

.60 

gooil 

430 

.Ob 

' tralliiul 

545 

,61 

witty 

480 

,.S1 

1 trusUv.Hlliy 

55') 

.62 

rti  ir-huadcd 

479 

.69 

1 

537 

,f'2 

k.a.lly 

479 

1.06 

1 (!c|Kfuial})c 

556 

.(.6 

admirable 

478 

.78 

j n;>ca-jr.imli'd 

550 

.,56 

p.tiicm 

473 

.70 

1 LiiHislUiul 

52') 

.47 

l.'.lontcd 

478 

,.S4 

1 \^isc 

.61 

pirccptivc 

477 

.81 

1 c.’nsi<  Urate 

577 

.76 

spirited 

477 

' C‘^'^‘''nalurcU 

527 

.52 

spf'rtsm.infikc 

477 

i.n 

527 

.66 

wi  1-m.inncrcd 

477 

1.05 

J mature 

522 

.66 

co-'pcMtivc 

476 

5’? 

.60 

ctiiicai 

476 

1.15 

521 

.73 

intellectual 

476 

.91 

Ulmi 

520 

.69 

vcri.alilc 

474 

.66 

iricutllv 

519 

.72 

capable 

471 

,63 

kiud-licarlcU 

51 1 

.87 

courageous 

471 

.85 

r.appy 

5M 

.77 

constructive 

468 

.46 

c'.can 

5M 

.99 

productive 

4oS 

.81 

intcrcbtin;; 

511 

.61 

jirogrcssive 

468 

.78 

ur.icldsh 

510 

.65 

individualistic 

40/ 

l.SO 

goou-humored 

507 

.73 

observant 

467 

.81 

lumor  liiic 

507 

.55 

ingenious 

466 

.75 

hunuTCus 

505 

.86 

lively 

466 

.75 

responsible 

505 

.76 

neat 

466 

.93 

cheerful 

501 

.S3 

punctual 

466 

1.26 

truElful 

5iU 

1.07 

logicual 

465 

.76 

ivarm-hcarlcd 

501 

.62 

prompt 

465 

1.16 

brn.id -minded 

505 

.50 

accurate 

■ 461 

.93 

gcr.t’.c 

503 

i.no 

sensible 

461 

.84 

\vcil-?i>‘hcn 

501 

.78 

creative 

4o2 

1.15 

C’iucnlvd 

500 

.73 

self-reliant 

462 

.96 

re.asor.aMe 

500 

.73 

tolerant 

461 

.91 

companionable 

■199 

.88 

amusing 

45)0 

.,39 

likable 

497 

.75 

clcan-cut 

460 

1.49 

trustini; 

497 

1.20 

gencrou.s 

459 

.89 

clever 

496 

.56 

sympathetic 

4.50 

1.05 

p'c  isani 

495 

.,S6 

energetic 

457 

.81 

c Hirtcous 

49( 

.9! 

h:ph'.‘=piritcd 

457 

.7.5 

fiUick-'AiUcd 

•191 

.78 

seif-controlled 

4.^0 

.69 

t.actful 

19! 

.,34 

tcr.-lcr 

456 

1.30 

h''lriful 

4)2 

.74 

active 

4.55 

.65 

appreciative 

402 

.78 

independrne 

455 

1.32 

1 if.iagin.itive 

492 

.96 

respectable 

45,5 

1.10 

cuist.anding 

492 

i.nn 

inventive 

453 

.86 

solf-disciplincd 

■I'll 

.75 

wholesome 

453 

1.14 

brilliant 

4')0 

.96 

conjtrnial 

452 

.82 

cniliisiaMic 

439 

.72 

cordial 

452 

.96 

IcvtMicadcd 

4R9 

.68 

cx[)crienccd 

451 

.76 

poiiic 

4S9 

1.11 

attentive 

4.50 

.84 

ori;;inal 

4SS 

.75 

cultured 

450 

.80 

smart 

453 

.65 

.Tank 

450 

1.10 

lnr,;iving 

4S6 

1.03 

purposeful 

4.30 

.36 

sbarp-witted 

4.36 

1.01 

decent 

4-19 

1.00 

i\cll-rcad 

436 

.67 

diligent 

419 

.82 

ambitious 

484 

1. 14 

realist 

449 

.94 

bi  ight 

4, S3 

.67 

C«gCf 

443 

,)50 

respectful 

4.33 

I.I7 

poised 

44.3 

.78 

eflicicnt 

4S2 

.94 

competent 

447 

.82 

good-lcmiKrcd 

4S2 

1.02 

realistic 

447 

.90 

grateful 

482 

1,00 

amiable 

446 

1.02 

(Table  continued  on  next  page) 


VI  I- 15 
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U-,„I  J 

1 

*-  1 

'■  1 

VVuriJ  ‘ 

1.  I 

J* 

■lit 

1..30 

soft  lie irtcil  1 

.1,87 

I 09 

V'.-  M 'U.-. 

did 

.81 

dignilied 

,180 

1.115 

cnierl.iininf; 

■113 

.03 

»>)}i)<.»S'*phjcal 

380 

1.78 

ndvcnlurous 

411 

.00 

idva'istic 

1.15 

vivacious 

410 

.91 

S')/{  -sDoken 

381) 

1,0.1 

coin;»o‘iod 

410 

.87  1 

disrinlincil 

379 

121 

rc(:i\al  ( 

■),10 

.09  1 

‘cri'ius 

379 

ronuuUic 

4.10 

1.10 

tldi  die 

375 

70 

priMicii'ut 

4.18 

.70 

Ci/nvinring 

.174 

.76 

4.1S 

I ,17 

pvj  su  * >ivc 

371 

92 

skiiiui) 

•l.iS 

.Ml 

fiJiodic/tt 

373 

1.07 

cMvrprisiny 

4,17 

.70 

()u:ck 

373 

1 1.1 

cr.’.cious 

417 

1.01 

Siijdiisiicatcd 

372 

.9.5 

:vl>lo 

4.16 

.08 

llniltv 

372 

,75 

nice 

-^.id 

1. 28 

SvntimcfUal 

3-  1 

1. 10 

n;^'’LC;i!4c  i 

414 

.9,3 

objective 

1 3,0 

1,81 

Sliil'ul  1 

•idd 

..8.3 

nonconforming 

369 

! 1 .33 

Curious 

4.12 

1.13 

rikd'.U'ous  ] 

.100 

1 2,24 

nunicrn 

4.12 

.03 

, nK'slhcmatical  i 

.307 

' 1.01 

cl’.nni'iii" 

AM) 

98 

mcdiialive 

3f*0 

1.52 

social'Ic 

420 

' .85 

fearless  1 

3f'6 

1. 12 

modest 

42S 

1.2.3 

svstenutit  ] 

300 

1.12 

dr'isi\c  ' 

427 

1.03  1 

subtle  1 

3'»5 

1,00 

liuod»!c 

427 

1..H 

n'^rmal  i 

s3o2 

1.21 

u«iv 

427  j 

.82 

darimt  i 

3o0 

1 ,03 

popular 

4?.r, 

,98 

middleciass 

.ViO 

.99 

upright 

420 

I.Ol 

I'jckv 

3.5S 

1 .10 

U:cra?v 

42,4 

1.10 

proud 

3.3S 

1.00 

pr:»c:icM 

425 

.73 

sensitive 

35S  i 

2.00 

liuht'hvnrted 

424 

.99 

moralistic 

3.47 

2.13 

, v.cH-brcd 

421 

1.13 

talkative 

3.12  1 

1..12 

' rciir.cri 

422 

1.10 

1 excited 

351  1 

..80 

i svU  confident 

421 

.81 

1 moderate 

.351  1 

.00 

cool-headed 

420 

.97 

satirical 

.1.11 

1.18 

studious 

41S 

1.00 

! prudent 

34S 

1.71 

vcnlurcsiimc 

417 

.55 

1 reservi'fl 

34S 

1.00 

discreet 

410 

1.20 

persistent 

.147 

1.00 

informal 

410 

1,00 

meticulous 

340 

1.38 

th'iT-ough 

410 

.94 

unconventional 

340 

.92 

cxuilcrant 

414 

.97 

deliberate 

315 

1.40 

in(;uisilivc 

413 

1.17 

i painstaking 

345 

1.44 

ec.'rvtrolng 

412 

1.30 

I'old 

3.10 

1 22 

cut:'oii:g 

1 412 

l.iO 

suave 

3.?5  ! 

I.  .10 

'i.t-suiFicicn! 

412 

1..30 

caul i 'US 

3.11 

.77 

411 

111 

innocent 

332 

1.27 

con<l:lent 

411 

1.01 

inoffensive 

332 

.91 

moral 

411 

1.67 

shrewd 

328 

2.47 

tcjj-nssurcd 

411 

.72 

r.icthociic.al 

325 

1.54 

lintirinj 

410 

nonchalant 

324 

1.2.1 

!:0{>rfiil 

1 400 

.02 

sclf-coutcnted 

321 

2.04 

calm 

400 

.81 

pcrfcclionistic 

322 

1.09 

strong-minded 

101 

1.27 

for\Mird 

,118 

1.12 

[visitive 

•103 

1.28 

cxcitsiblc 

317 

1.15 

conf/ornt 

•lOI 

1.04 

out«jKkcn 

313 

1.77 

a: tistic 

400 

1.58 

prideful 

313 

1.99 

f’rccisc 

400 

1.0.3 

rplirt 

311 

,91 

scicnlific 

•100 

1.05 

impulsive 

307 

1.58 

orderly 

.309 

.84 

ai,'"rrS5ivc 

30 1 

I, -IS 

Social 

30.S 

1,05 

chani;cal)le 

207 

1,08 

direct 

.300 

1.07 

cnnscrvjlivc 

205  ' 

.92 

cr.rrfu) 

3yo 

,84 

shv 

291  1 

.89 

cindid 

389 

1.43 

j licsilarit 

290 

.76 

comical 

3.89 

1,09 

unpredictable 

290 

i.:o 

oiiliKinj; 

389 

1.53 

solemn 

2.S9 

.85 

Self-critical 

389 

1.55 

blunt 

287 

1.03 

fashionable 

387 

1,28 

self-righteuus  | 

2,87 

2.46 

rclipotts 

387 

1.93 

average 

284  j 

.90 

(Table  continued  on  next  page) 


VII-16 


TABLE  VII- 14  (cont.) 
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1 


Word 

L 

t* 

Wtjr.J 

L 

J* 

2(t.^ 

3.48 

[ Spendthrift 

221 

.73 

2So 

1,23 

! icmt'cramcntal 

221 

1.10 

unluckv 

2S0 

.52 

giillililc 

219 

.88 

lu-hflil 

279 

.65 

indecisive 

219 

1 .90 

sclf-conccrncH 

270 

1.01 

sillv 

2.9 

1.53 

aull;orit:Uivc 

274 

l.Sl 

sul)missivc 

219 

.90 

274 

] 06 

iipv.ludious 

218 

1 .06 

rc-Nllc^s 

274 

.76 

preoccui)icd 

216 

1.12 

ch»'>o<;v 

272 

1.62 

louse 

215 

.90 

5<  lf-’K»‘?9CSS5C(l  1 

1 '7 ) 

2..53 

fearful 

214 

.69 

naive 

270 

l.Oo 

unronLanlic 

214 

1.33 

<*P|>i»rtunisl 

270 

2.47 

nl'scnt-mindccl 

213 

1,00 

tluMlric.il 

260 

1..50 

ijuprnclical 

213 

1.12 

unsophi>licatcd 

2('i7 

1.23 

wiilidrawn 

213 

.80 

iim>rcs.?ionablc 

2t)() 

.01 

unadventurous 

212 

.93 

oruinary 

206  1 

.n 

sarcastic 

210 

1.30 

strict 

206 

1.30 

sad 

209 

.93 

skcnlicrxl 

264 

1..52 

unemotional 

209 

1.50 

cx(iava::ant 

204 

..88 

u'orr\'ing 

209 

.71 

forceful 

203 

1.65 

lii;;h-strung 

208 

1.57 

cunnii^c 

202 

2.KS 

unoriginal 

207 

.81 

inexi)cricnccd 

202 

.66 

unpoi.scd 

206 

.76 

unincll»o(Jical 

202 

.86 

compulsive 

205 

1.20 

daredevil 

201 

1.23 

worrier 

205 

1.00 

wordv 

261 

1.05 

demanding 

203 

.94 

da\ dreamer 

260 

.05 

utihappy 

203 

.98 

conventional 

260 

.05 

indidcrent 

202 

1 31 

materialistic 

260 

1.66 

uncultured 

201 

1.00 

self-satisfied 

260 

2.00 

clumsy 

199 

,92 

rcbcilious 

258 

1.40 

insecure 

19S 

eccentric 

25- 

I 58 

uncetcrfaffiing 

' 19.5 

.65 

opinionated 

257 

l.OS 

imiiativc 

19.S 

1,17 

stern 

257 

1.10 

melancholy  ' i 

198 

1,13 

If»nc!v 

256 

1.02 

mediocre 

197 

l.IO 

dependent 

254 

1.97 

o!)stinatc  i 

197 

.94 

unsv.y.cnmtic 

253 

.92 

unhealthy 

107 

1.42 

self-conscious 

240 

.92 

hca'istrong 

106 

1.17 

undecided 

240 

.86 

nervous 

105 

.83 

resigned 

24S 

1.22 

nonconfident 

106 

.87 

clo.vnish 

247 

1.73 

stubborn 

196 

1,3! 

anxious 

246 

.90 

unimaginative 

193 

1.06 

conforming 

216 

1.26 

down-henrted 

194 

.97 

critical 

243 

1.46 

unol'servant 

191 

.90 

conformist 

241 

1.15 

inconsistent 

103 

.91 

radi'^al 

241 

l.M) 

unj'unctual 

192 

.96 

(lUsatisfic'l 

230 

1 .65 

unipflustrious 

191 

.81 

oki-hsbioned 

239 

1.39 

tii'turbcd 

ISO 

.97 

mcci^ 

23S 

1.37 

fupcrstilious 

189 

1.33 

frivolous  I 

217 

1 55 

frustrated 

188 

.93 

discontent'‘d 

237 

1.00 

illogical  1 

186 

.97 

troubled 

235 

.71 

rash 

185 

..59 

irrcli'-dous 

23  4 

171 

unenthusiastic 

ISO 

1.0.5 

overcautious 

220 

.55 

inaccurate 

185 

.59 

silent 

22S 

.S3 

noninquisitive 

184 

.90 

tou'^h 

228 

1.74 

unagreeable 

184 

1.08 

ungraceful 

22S 

.87 

jumpv 

183 

.73 

argumentative 

227 

1.25 

jjossc-ssivc 

183 

1.02 

wiili'irav.-ins 

227 

.78 

purposeless 

183 

1.90 

uninquisitive 

225 

.94 

mtjod  V 

182 

1.30 

forcitfiil 

221 

.85 

unenlcrjirising 

1,80 

.81 

inhibited 

224 

.87 

uniuLclIectual 

IN) 

117 

unskilled 

224 

,71 

unwise 

180 

.79 

crafty 

223 

1.98 

- oversensitive 

170 

.77 

passive 

223 

.97 

incflicicnt 

178 

.08 

immodest  i 

222 

I 61 

recklc.ss 

178 

1.42 

unpopidar 

222 

.80  i 

pompc'is 

177 

1.43 

timid  ' 

077  1 

.78  1 

urcongonial 

175 

.59 

(Table  continued  on  next  page) 


VII-17 


TABLE  VII-14  (cont.) 
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WlTv. 

i '• 

[“7 

I Word  i L 

J* 

inti 

17.S 

.02 

1 tirc'onin 

MO 

1 .70 

un;icc.'m''d.'iting 

I7.J 

.(iS 

dis<>hrtiicnt 

123 

1 1.23 

ir.7 

.s.s 

C'lnpiiining 

127 

.74 

177 

'57 

llOk'ss 

127 

68 

Cvr.ic.ii 

1 171 

1.26 

vain 

127 

.99 

ifv’rv 

in 

.00 

h7.y 

126 

.8.8 

lijtic 

1 i('» 

.72 

unnpjircciatlve 

1 >6 

1 ■•''1 

i 1 

.61 

m;'.l:iilju5lcd 

123 

1 1.07 

ur,i'r.vl!i;;cnt 

1 m 

1 1 07 

aimless 

122 

1 1.16 

tloininccrin^j 

t()7 

l..)2 

boi*itful 

122 

1 .74 

scold' nr 

loo 

.67 

dull 

121 

1 

depressed 

Kjf, 

1.01 

gossipy 

119 

.06 

unoMiKinr 

Ki.S 

.SO 

unipncaling 

110 

l.Ot 

pcrsimistic 

161 

1.06 

li>*T)ochondriac 

ns 

.8.8 

umtlctUivc 

IM 

.74 

irrit.'iting 

118 

1 ■<' 

^oi*itcrous 

161 

1.10 

petty 

1 118 

1 

suspicious 

16.) 

.SS 

shallow 

1 118 

1 1.00 

iii.v.icnfivc 

1 162 

1.13 

dca'[)tivc 

! 117 

1 1. 01 

ON  crcontidcnL 

1 162 

grouchy 

.61 

s:mo' 

1 161 

.68 

cTolis!  ical 

1 116 

1 1.25 

I’.ns'  'ci.i^'lc 

I 161 

1.13 

iVK'ddlc'omc 

116 

1 .62 

mifoduclivc 

1 160 

.0.3 

uncivil 

116 

.06 

w .'.s;  vful 

l«l 

.67 

cold 

113 

.9  * 

i.=;6 

1.13 

unsportsmanlike 

113 

.72 

IK  tllTlfu! 

.50 

hof.sv 

112 

.,89 

i-Urnpcrcd 

I.V) 

ir)pU'.isi'ig 

112 

.71 

hot-hfidcd 

I.IS 

1. 00 

cowardly 

no 

..82 

1 uns 'ciil 

l.'S 

1.16 

discourteous 

no 

.80 

I envious 

1 07 

.77 

incompetent 

no 

.68 

. cNwcrilic.il 

i l.i7 

.S) 

childi'^h 

109 

.,81 

p«'lu!nin^ 

1 116 

1 ..30 

superlicial 

1U9 

.95 

1 150 

1 ..3S 

urrjr  itcful 

1 100 

.71 

1 

1,02 

.SL'lf-rniiu  ilcd 

1 I'l.s 

! 1.14 

foo’.lu'irdy 

i i.vt 

1.00 

i hir<Micarled  | 

K>7 

1.00 

injrr.iturc 

161 

1 X? 

unf-.ir 

107 

1.60 

' don;i''.;rlin<; 

I'i 

i 1.2S 

irres’ionsiblc 

106  I 

1.17 

shenvv  I 

i 111 

1 02 

prejudiced 

106 

1.33 

■ SloMpV 

I ' 

.0.6 

Or.ng.ging 

104 

.72 

' rnsympitlu’f ic  ^ 

1 1 ' ' 

1.32 

jealous 

lot 

. / 1 

uncojiiprotnising  ! 

! l.'.l 

1 1.26 

unpieisant 

104 

.81 

, liot-icnpcrcd 

' I.'^i 

1 .0(5 

' unicliibic 

104 

.93 

1 nnjrotic 

I, '2 

1.31 

im;iolitc 

103 

.72 

1 u-fp'irtin? 

1.i2 

.SO 

crt:dc 

102 

1.29 

1 finickv 

150 

, nosey 

102 

.07 

1 rc=c"tful 

150 

.90 

humorless 

lUl 

.82 

j unru’v 

150 

RS 

1 quarrelsome 

101 

.11 

1 f.-iult-firniins 

MS 

.06 

abusiyc 

100 

.83 

;nc5.iv 

117 

.7,3 

1 tlistui.ilfu! 

1 9') 

1.24 

1 nis'.l 

M7 

1.2.S 

intolerant 

98 

.97 

M6 

.78 

; unforgiving 

93 

.71 

scor.  ful 

14.) 

boi-ing 

97 

.76 

anllsocial 

144 

1.21 

unethical 

97 

.90 

imt  i''lc 

14.1 

.85 

urrcasnnahlc 

97 

.86 

slirr’.' 

M.5 

.60 

sclf-rcntcrcd 

96 

1.13 

t.lCtl.'SS 

142 

.85 

snobbish 

96 

.87 

c.ircic'j 

MO 

.91 

utrlsindlv 

90 

.64 

foolish 

140 

.S3 

ill  mannered 

95 

.76 

tr<’u!'!'  some 

140 

.73 

ill-tcmpcrctl 

95 

.62 

un.’raci  HIS 

140 

.71 

unfriendly 

92 

nc'u'i^^'-nl 

UO 

.6.3 

liostilc 

91 

.77 

wisliN'-wnshv 

I.M 

1.17 

di'^Hkablc 

PO 

.78 

)ir')fanc 

M7 

1 .65 

uilru-aitical 

00 

.98 

;;loanu' 

M6 

84 

oiTcnsivc 

ss 

.83 

hcl|)lc'!> 

M6 

1.12 

belligerent 

86 

.79 

disif;rccahlc 

134 

.6)7 

undcrh.indcd 

*6 

1.19 

touchy 

134 

.83 

annoying 

84 

.66 

irrationuK 

130 

,70 

disrespectful 

83 

.79 
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Word 

L 

]* 

Word 

L 

** 

83 

.87 

unkind 

66 

.71 

82 

.65 

untrustworthy 

65 

.63 

80 

.58 

deceitful 

62 

.96 

1.10 

dishonorable 

52 

.47 

.92 

malicious 

52 

.49 

78 

.88 

obnoxious 

48 

.60 

77 

.76 

untruthful 

43 

.43 

76 

.79 

dishonest 

41 

.51 

concfilcd 

74 

.84 

cruel 

40 

.54 

i;rccdv 

72 

.61 

mean 

37 

.48 

spiteful 

72 

.61 

phony 

27 

.30 

insulting 

69 

.86 

liar 

26 

.36 

insincere 

66 

.65 
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TABLE  VII- 15 


Means  and  Standard  Deviations  for  Phrases  of 
Degrees  of  Adequacy 

(Matthews,  Wright,  and  Yudowitch,  1975) 


Phrase 

Mean 

SD 

Totally  adequate 

4.6^0 

.846 

Absolutely  adequate 

4.540 

.921 

Completely  adequate 

4.490 

.825 

Extremely  adequate 

4.412 

.719 

Exceptionally  adequate 

4.380 

.869 

Entirely  adequate 

4.340 

.863 

Wholly  adequate 

4.314 

1.038 

Fully  adequate 

4.294 

.914 

Very  very  adequate 

4.063 

.876 

Perfectly  adequate 

3-922 

1.026 

Highly  adequate 

3.843 

.606  . 

Most  adequate 

3.843 

.978 

Very  adequate 

3.420 

.851 

Decidedly  adequate 

3.140 

1.536 

Considerably  adequate 

3.020 

.874 

Quite  adequate 

2.980 

.979 

Largely  adequate 

2.863 

.991 

Substantially  adequate 

2.608 

1.030 

Reasonably  adequate 

2.412 

.771 

Pretty  adequate 

2.306 

.862 

Rather  adequate 

1.755 

.893 

Mildly  adequate 

1.571 

.670 

Somewhat  adequate 

1.327 

.793 

Slightly  adequate 

1.200 

.566 

Barely  adequate 

.627 

.928 

Neutral 

.000  '■ 

.000 

Border 1 ine 

-.020 

.316 

Barely  inadequate 

-1.157 

.638 

Mildly  inadequate 

-1.353 

.621 

Slightly  inadequate 

-1.380 

.772 

Somewhat  inadequate 

-1.882 

.732 

Rather  inadequate 

-2.102 

.974 

Moderately  inadequate 

-2.157 

1.017 

Fairly  inadequate 

-2.216 

.800 

Pretty  inadequate 

-2.347 

.959 

Considerably  inadequate 

-3.600 

.680 

Very  inadequate 

-3.735 

.777 

Decidedly  inadequate 

-3.780 

.944 

Most  inadequate 

-3.980 

1.545 

Highly  inadequate 

-4.196 

.741 

(Table  continued  on 

next  page) 
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TABLE  VII -15  (Cont.) 


Means  and  Standard  Deviations  for  Phrases  of 
Degrees  of  Adequacy 

(Matthews,  Wright,  and  Yudowitch,  1975) 


Phrase 

Mean 

3D 

Very  very  inadequate 

-4.460 

.537 

Exceptionally  inadequate* 

-4.560 

.637 

Extremely  inadequate 

-4.608 

.527 

Fully  inadequate 

-4.667 

.676 

Exceptionally  inadequate 

-4.680 

.508 

Wholly  inadequate 

-4.784 

.498 

Entirely  inadequate 

-4.792 

.644 

Completely  inadequate 

-4.800 

.529 

Absolutely  inadequate 

-4.880 

.431 

Totally  inadequate 

-4.900 

.412 

Note . * In'icates  duplicated  phrase. 


TABLE  VI I -16 


Means  and  Standard  Deviations  for  Phrases  of 
Degrees  of  Acceptability 
(Matthews,  Wright,  Yudowitch,  1975) 


Phra  se 

Mean 

SD 

Wholly  acceptable 

4.725 

.563 

Completely  acceptable 

4.686 

.610 

Fully  acceptable 

4.412 

. .867 

Extremely  acceptable 

4.392 

.716 

Most  acceptable 

4.157 

.915 

Very  very  acceptable 

4.157 

.825 

Highly  acceptable 

4.07  0 

.631 

Quite  acceptable 

3.216 

.956 

Largely  acceptable 

3.137 

.991 

Acceptable 

2.392 

1 .456 

Reasonably  acceptable 

2.294 

.722 

Moderately  acceptable 

2.280 

.722 

Pretty  acceptable 

2.000 

1.125 

Rather  acceptable 

1.939 

.818 

Fairly  acceptable 

1.840 

.724 

Mildly  acceptable* 

1.804 

950 

Mildly  acceptable* 

1.686 

.700 

Somewhat  acceptable 

1.458 

1.241 

Barely  acceptable 

1.078 

.518 

Slightly  acceptable 

1.039 

.522 

Sort  of  acceptable 

.940 

.645 

Borderline 

.000 

.200 

Neutral 

.000 

.000 

Marginal 

-.120 

.515 

Barely  unacceptable 

-1.100 

.300 

Slightly  unacceptable 

-1.255 

.589 

Somewhat  unacceptable 

-1.765 

.674 

Rather  unacceptable 

-2.020 

.836 

Fairly  unacceptable 

-2.160 

.880 

Moderately  unacceptable 

-2.340 

.681 

Pretty  unacceptable 

-2.412 

.662 

Reasonably  unacceptable 

-2.440 

.753 

Unacceptable 

-2.667 

1.381 

Substantially  unacceptable 

-3.235 

.899 

Quite  unacceptable 

-3.388 

1.066 

largely  unacceptable 

-3.392 

.818 

Considerably  unacceptable 

-3.440 

.779 

Notably  unacceptable 

-3.500 

1.044 

Decidedly  unacceptable 

-3.83/ 

1.017 

Highly  unacceptable* 

-4.220 

.576 

Highly  unacceptable* 

-4.294 

.535 

(Table  continued  on  next  page) 
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TABLE  VII- 16  (Cent.) 


Means  and  Standard  Deviations  for  Phrases  of 
Degrees  of  Acceptability 
(.Matthews,  Wright,  Yudowitch,  1975) 


Phrase 

Mean 

SD 

Most  unacceptable 

-4.420 

.724 

Very  very  unacceptable 

-4.490 

.500 

Exceptionally  unacceptable 

-4.540 

.607 

Extremely  unacceptable 

-4.686 

.464 

Completely  unacceptable 

-4.900 

.361 

Entirely  unacceptable 

-4.900 

.361 

Wholly  unacceptable 

-4.922 

.269 

Absolutely  unacceptable 

-4.922 

.334 

Totally  unacceptable 

-4.941 

.235 

Note . * Indicates  duplicated  phrases. 
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Phra  se 

Best  of  all 
Absolutely  best 
Truly  best 
Undoubtedly  best 
Decidedly  best 
Bes  t 

Absolutely  better 
Extremely  better 
Substantially  best 
Decidedly  better 
Conspicuously  better 
Moderately  better 
Somewhat  better 
Rather  better 
Slightly  better 
Barely  better 
Absolutely  alike 
Alike  j 

The  same 
Neutral 
Borderline 
Margins  1 
Barely  worse 
Slightly  worse 
Somewhat  worse 
Moderately  worse 
Noticeably  worse 
Worse 

Notably  worse 
Largely  worse 
Considerably  worse 
Conspicuously  worse 
Much  worse 
Substantially  worse 
Decidedly  worse 
Very  much  worse 
Absolutely  worse 
Decidedly  worst 
Undoubtedly  worst 
Absolutely  worst 
Worst  of  all 
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‘viations  for  Phrases 
;ompar i son 

and  Yudowitch,  1975) 


Mean 

SD 

4.896 

.510 

4.843 

.459 

4.600 

.721 

4.369 

.823 

4.373 

.839 

4.216 

1.459 

4.060 

.988 

3.922 

.882 

3.700 

.922 

3.412 

.933 

3.059 

.802 

2.255 

.737 

1.843 

.801 

1.816 

.719 

1.15 

.776 

.961 

.656 

.588 

1.623 

.216 

.847 

.157 

.801 

.000 

.000 

- .061 

.314 

-.184 

.919 

-1.039 

.816 

-1,216 

.498 

-2.078 

.860 

-2.220 

.944 

-2.529 

1 .036 

-2.667 

1.423 

-3.020 

1 .038 

-3.216 

1.108 

-3.275 

1.206 

-3.275 

.887 

-3.286 

.808 

-3.460 

.899 

-3.760 

.907 

-3.941 

.752 

-4.431 

.823 

-4.431 

.748 

-4.510 

.872 

-4.686 

1.291 

-4.776 

1.298 
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TABLE  VI 1-18 


Scale  Scores  of 
Based  on  Over-All  A 
(USA.TECOM, 

Statement  s 

cceptability 

1973) 

Statement 

Average 

Standard 

Deviation 

Excel  lent 

6,27 

0,54 

Perfect  in  every  respect 

6.22 

0.86 

Extremely  good 

5.  74 

0.81 

Very  good 

5.19 

0.75 

Lnusually  good 

5.03 

0.98 

Very  good  in  most  respects 

4.62 

0.72 

Above  average 

4.56 

0.75 

Quite  satisfactory 

4.35 

0.95 

Good 

4.25 

0.90 

More  than  adequate 

4.13 

1.11 

About  average 

3.77 

0.85 

Satisfactory 

3.69 

0.87 

Moderately  good 

3.58 

0.  77 

Adequate 

3.39 

0.87 

Could  use  some  minor  change 

s 

3.28 

1.09 

Not  good  enough  for  extreme 

conditions 

3.10 

1.30 

Not  good  for  rough  use 

2.72 

1.15 

Not  quite  adequate 

2.40 

0.85 

Not  very  satisfactory 

2.11 

0.76 

Barely  adequate 

2.10 

0.84 

Not  very  good 

2. 10 

0.85 

Below  average 

2.03 

0,79 

L'nsatisfactuLy  but  usable 

2.00 

0.87 

Needs  major  changes 

1.97 

1.12 

Not  adequate 

1.83 

0.98 

Barely  acceptable 

1.79 

0.90 

Not  good  enough  for  general 

use 

1.76 

1.21 

Better  than  nothing 

1.22 

1.08 

Poor 

1.06 

1.11 

Very  poor 

0.  76 

0.95 

Very  unsatisfactory 

0.69 

1.32 

Extremely  poor 

0.36 

0.76 

TABLE  VII- L9 


Meaning  of  Frequency  Words 
(Simpson,  1944) 


757o  of  Students 
Thought  the  Term 
Meant  Less  Than  This 
Term Percentage  of  the  Time 


Always 

100 

Very  often 

93 

Usually 

90 

Often 

85 

Genera lly 

85 

Frequently 

80 

Ratlier  often 

80 

About  as  often  as  not 

50 

Now  and  then 

35 

Sometir.  s 

35 

Occasionally 

33 

Once  in  a while 

27 

Not  often 

20 

Seldom 

18 

Usually  not 

IS 

Hardly  ever 

13 

Very  seldom 

10 

Rarely 

10 

Almost  never 

5 

Never 

2 

J 


( 


•ilp— 


TABLE  VI 1-20 
Correlations  of 

Jones  and  Thurstone  and  Myers  and  Warner 
"Scale"  Values  for  13  Stimuli 
(Mittelstaedt , 1971) 


Myers-Warner  Groups 

r 

Housewives 

.992 

Executives 

.986 

Graduates 

.989 

Undergraduates 

.993 

TABLE  VII -21 


Correlations  of 

Myers-Warner  and  Cliff  Scale  Values 
for  11  Stimuli 
(Mittelstaedt,  1971) 


Myers-Warner  Groups 


Cliff  Study  Groups 


Wayne  State  Princeton 


Dartmout  h 


Housewives 

.990 

.990 

.987 

Execut ives 

.990 

.988 

.989 

Graduates 

.993 

.994 

.991 

Undergraduates 

.996 

.99? 

.995 

TABLE  VI I -22 


Summary  of  Studies  on  Perceived  Favorableness 
of  Commonly  Used  Words  and  Phrases 


Experimenter 

Type  of  Subiects 

No.  of 
Subiects 

Type 

of  Words 

No.  0 
Words 

Mosier  (1940, 
1941a,  1941b) 

Psychology  students 

140 

Adjectives 

289 

Jones  & Thurstone 
(1955) 

Army  enlisted  personnel 

905 

Adverbs 
Adject ives 

7 

51 

Myers  & Warner 
(1968) 

Housewives , 

Business  executives. 
Graduate  business 
administration  students. 
Undergraduate  business 
administration  students 

25 

36 

40 

25 

Adjectives 

50 

Cliff  (1959) 

Undergraduate  students 

537 

Adverbs 

9 

Altemeyer  (I'^yO) 

College  students 

586 

Adverbs 

8 

Dodd  & Gerberick 
(1960) 

Unknown 

40 

Adjectives 

81 

Anderson  (1968) 

College  students 

100 

Personality 
traits  words 

555 

USA,  TECOM  (1973) 

Unknown 

Unk. 

Adjectives 

32 

Simpson  (1944) 

High  school  and 
college  students 

100 

Frequency  terms 

20 

Matthews,  Wright, 
& Yudowitch  (1975) 

Army  enlisted  personnel 
and  officers 

51 

Adjective  Phrases 

141 
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Table  VII-22  gives  a summary  of  the  studies  conducted  to  show  the 
perceived  favorableness  of  word  and  phrases.  As  can  be  seen  in  the  table, 
a large  variety  of  subjects  have  been  used  in  the  studies.  By  looKing  at 
the  tables  presented  in  this  chapter  common  words  can  be  found  across 
studies.  Hosier  (1941a)  showed  that  the  same  word  gets  the  same  rating 
if  it  is  repeated  in  a list,  which  implies  that  words  have  an  inherent 
meaning.  The  fact  that  words  have  an  inherent  meaning  or  perceived  favor- 
ableness independent  of  context  and  instrument  used  was  supported  by  Dodd 
and  Gerberick  (1960)  and  Mittelstaedt  (1971)  . 


Chapter  VIII 


CONSIDERATIONS  RELATED  TO  THE  PHYSICAL  CHARACTERISTICS 
OF  QUESTIONNAIRES 


This  chapter  considers  four  topics  related  to  the  physical  character- 
istics of  questionnaires;  the  location  of  the  response  alternatives 
relative  to  the  question  stem;  questionnaire  length;  format  considerations 
Such  as  color,  type  size,  spacing,  and  numbering;  and  the  use  of  answer 
sheets. 


Location  of  Response  Alternatives  Relative  to  Stem 

Only  two  articles  were  found  that  pertained  to  the  location  of 
response  alternatives  relative  to  the  question  stem.  Blumberg,  DeSoto, 
and  Keuthe  (1966)  had  over  100  subjects  rate  well-known  names  on  a variety 
of  traits,  using  a nine  point  scale.  They  concluded  that  untrained  raters 
can  make  relatively  error-free  ratings  without  being  influenced  by  whether 
or  not  the  "good"  end  of  a graphic  rating  scale  was  at  the  left,  right, 
top,  or  bottom. 

The  purpose  of  a study  by  Madden  and  Bourdon  (1963)  was  to  determine 
whether  mean  job  evaluation  ratings  would  differ  as  a function  of  seven 
variations  in  rating  scale  format.  One  of  the  variations  included  printing 
responses  vertically  or  horizontally.  Sixty  basic  airmen  rated  15  occupa- 
tions on  nine  job  requirements  for  each  format.  It  was  concluded  that  the 
rating  scale  format  .vas  a determiner  of  the  judgment  of  the  raters  in  the 
samp le  . 


Questionnaire  Length 

This  section  considers  the  effects  of  overall  questionnaire  instrument 
length  on  response  rate,  response  inconsistency, and  validity.  Disagree- 
ment was  found  on  the  effect  of  length  on  the  response  rate  of  mailed 
questionnaires.  Sletto  (1940),  in  a 300  subject  pretest  of  10,  25,  and 
35  page  mailed  questionnaires,  found  no  significant  effect  of  length  on 
response  rate.  Champion  and  Sear  (1969)  used  3,  6,  and  9 page  versions 
of  the  same  number  of  questions  spaced  so'  as  to  affect  apparent  length. 
Mailing  the  questionnaires  to  802  subjects,  their  results  contradicted 
Sletto' s findings  since  they  obtained  a greater  response  rate  with  the 
longer  questionnaire.  However,  the  overall  response  rate  was  only  357o. 

Three  other  investigations  concluded  that  the  response  rate  for 
mailed  questionnaires  is  greater  for  shorter  questionnaires.  Leslie 
(1970)  concluded  (without  reporting  data,  however)  that  one  or  two  page 
questionnaires  Improve  the  response  rate  for  mailed  questionnaires.  Ford 
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(1968)  found  a sliglitiy  increased  (but  nonsignificant)  response  rate  in 
a 1,656  subject  test  of  the  use  of  a printed,  folder-type  questionnaire,  as 
compared  with  a larger  appearing  mimeographed,  stapled  format.  One  versus  two 
page  mailed  questionnaires  were  tested  by  Bauer  and  Meissner  (1963). 

Tliey  found  that,  in  going  from  the  one  page  to  the  two  page  format,  non- 
response increased  from  negligible  to  over  5%.  They  also  found  that; 
absolute  correctness  of  responses  dropped  from  53.57<>  to  477«;  and  nonsense 
answers  increased  from  1.57.  to  5%,  Their  report,  however,  gave  insuf- 
ficient information  to  allow  the  reader  to  check  the  conclusions. 

The  effect  of  instrument  length  (in  terms  of  total  number  of  items) 
and  other  characteristics  on  response  inconsistency  was  studied  by  Ace 
and  Davis  (1972).  Using  177  college  sophomores > they  found  that  response 
Inconsistency  was  only  somewhat  influenced  by  length  and  format,  but 
considerably  influeiiced  by  the  type  of  scoring. 

There  have  been  a number  of  studies  on  the  effect  of  instrument 
length  on  validity,  but  since  they  were  concerned  with  cognitive  and 
achievement  tests, they  were  outside  the  main  scope  of  this  review.  For  example, 
Brokaw  (1951),  using  six  tests  administered  to  223  Air  Force  basic  airmen 
to  class  them  for  training  in  technical  specialties,  found  composite 
validity  against  course  grades  was  .56  for  half-length  tests,  .57 
for  ful 1- length  tests.  Battery  reliability  of  the  half-length  tests  was 
.90,  compared  with  .95  for  the  full-length  tests.  Since  the  tests  measured 
reasoning  and  knowledge  of  facts,  the  results  may  not  be  genera  lizab le 
to  questionnaires  as  defined  for  this  review. 

In  another  study,  Appel  (1959)  compared  true-false  and  forced  choice 
questionnaires,  each  administered  to  about  400  college  students.  He 
concluded  that  for  longer  forms  the  forced  choice  method  is  likely  to 
result  in  greater  va 1 idit y , whi le  for  shorter  forms  the  true-false  method 
is  likely  to  prove  superior. 

In  conclusion,  disagreement  was  found  on  the  effect  of  length  on 
response  rate  to  mailed  questionnaires,  little  information  was  found  on 
the  effect  of  length  on  response  consistency,  and  nothing  was  found  relating 
length  to  the  validity  of  questionnaires  as  defined  for  this  review. 


Questionnaire  Format  Considerations 

Lit*-le  specific  information  was  found  related  to  questionnaire  format 
considerations  such  as  type  size,  spacing,  color,  etc.  Sletto  (1940)  had 
47  students  rate  the  esthetic  appearance  of  10  different  questionnaire 
formats,  and  found  that  preferences  were  not  highly  individualistic  nor 
erratic.  Wolfe  (1956)  discussed  the  effects  of  layout  appearance,  arrange- 
ment of  questions  and  responses,  and  instructions.  He  noted  differences, 
but  provided  no  empirical  data.  Finally,  Lehman  (1967)  reported  that 
varying  the  length  of  a rating  scale  line  from  three  and  one-half  to  seven 
inches  appeared  unimportant  in  similarity  ratings. 
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Tha  Use  of  Answer  Sheets 


Several  articles  were  located  regarding  the  use  of  answer  sheets, 
although  this  topic  was  not  stressed  in  the  literature  review.  Dunlap 
(1940)  tested  serially  numbered,  repetitively  numbered,  articulated, 
and  unarticulated  answer  sheets  in  all  combinations,  using  20  groups  of 
fourth  and  eighth  graders.  The  sizes  of  the  groups  ranged  from  251  to 
364.  His  major  conclusions  were: 

1.  Marking  articulated,  repetitively  numbered  separate  answer 
sheets  is  equally  as  satisfactory  as  underlining  the  correct 
response . 

2.  There  is  evidence  that  repetitive  numbering  results  in  more 
errors  than  serial  numbering. 

3.  The  use  of  articulated,  serially  numbered  answer  sheets  is 
entirely  satisfactory  when  compared  with  the  results  in  using  the 
underlining  method. 

4.  The  use  of  unarticulated  but  serially  numbered  answer  sheets 
also  seems  justified.  There  was,  however,  a slight  difference  in 
results  favoring  articulated,  serially  numbered  answer  sheets. 

5.  Unarticulated,  repetitively  numbered  answer  sheets  are  some- 
what less  satisfactory  substitutes  for  the  underlining  type  of  test 
than  serially  numbered,  articulated  sheets. 

6.  There  is  no  evidence  that  the  separate  answer  sheet  cannot  be 
used  with  children  in  grade  levels  at  least  as  low  as  the  fourth. 

7.  There  is  no  evidence  to  support  the  contention  that  in  a multiple 
choice  test  there  is  a psychological  advantage  in  having  the  response 
Indicated  as  close  in  time  and  space  as  possible  (i.e.,  by  underlining) 
to  the  decision  as  to  the  correct  answer. 

8.  In  summary,  other  things  being  equal,  the  use  of  articulated, 
serially  numbered  answer  sheets  is  recommended,  particularly  if  the 
test  is  short  enough  to  enable  all  answers  to  be  recorded  on  a single 
side  of  the  sheet. 

In  a similar  study,  Faerber  (1951)  tested  230  students,  finding  a 
multiple  choice  test  with  a separate  answer  sheet  more  difficult  than 
open  answer,  right/wrong,  or  multiple  choice  without  a separate  answer 
sheet  when  the  tests  were  timed.  When  the  effects  of  time  were  removed, 
the  machine  scored  forms  (all  but  the  open  answer)  were  more  difficult  than 
the  open  answer  form.  A different  set  of  abilities  for  answering  machine 
scored  tests  was  hypothesized. 
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Bell,  Hobb , and  Hoyt  (1964)  compared  a standai  . iw.i  page  "lill  in  the 
mark"  machine  scored  ansv/er  sheet  with  a lu-w  cundeiised  niu-  page  answer 
sheet.  For  1,048  civilian  employees,  the  condensed  sheet  produced  signif- 
icantly lower  scores,  leading  the  authors  to  attribute  the  difference 
to  the  decreased  type  size.  They  concluded,  however,  that  measures  can 
be  taken  to  compensate  for  the  change.  In  a related  experiment  with  482 
subjects  using  cross-out  instead  of  fill-in  answering  on  the  condensed 
sheet,  no  significant  differences  were  found  between  the  one  and  two  page 
answer  sheets.  The  authors  did  not  examine  difference  in  subject  familiar- 
ity with  the  two  forms. 

A comparison  of  answer  sheets  was  also  made  by  Dizney,  Merrifield,  and 
Davi,p  (1966).  Using  an  arithmetic  test,  they  found  that, in  response  to 
each  of  three  questions , proportionally  more  students  using  the  IBM  1230 
format  reported  difficulty  in  using  the  answer  sheet  than  did  those 
students  using  the  older  IBM  305  format,  although  the  answer  sheets  were 
similar.  However,  a statistical  test  of  the  scores  of  those  reporting 
difficulty  using  the  two  formats  indicated  no  significant  differences. 

In  order  to  investigate  the  age  range  over  which  separate  answer 
sheets  could  be  used,  Solomon  (1971)  tested  116  inner  city  fourth  graders 
with  three  different  answer  formats  for  a reading  test:  answers  within 

the  booklet;  separate  hand  scorable  answer  sheets;  and  separate  machine 
scorable  answer  sheets.  No  statistically  significant  differences  were 
found.  For  an  older  age  group.  Hart,  Faust,  Rowland,  and  Lucier  (1964) 
recommended  the  use  of  optical  scan  and  reusable  booklets  with  graduated 
pages  whenever  possible.  Their  report,  us..ng  a sample  size  of  2,160, 
was  on  a study  of  the  attitudes  of  troops  in  the  tropics. 

In  a study  of  problems  related  to  the  use  of  answers  sheets,  Swordes 
(1952)  found  that  respondents  frequently  erred  in  using  the  1-st  space 
on  a multiple  choice  answer  form  when  there  were  more  spaces  than  actual 
choices.  Precautions  should,  therefore,  be  taken,  such  as  using  the  same 
number  of  distractors. 

Although  the  studies  reported  above  had  to  do  with  the  use  of  answer 
sheets  with  achievement  tests,  the  results  would  appear  generalizable  to 
the  construction  of  questionnaires. 


Chapter  IX 


CONSIDERATIONS  RELATED  TO  TliE  ADMINISTRATION 
OF  QUESTIONNAIRES 


Considerations  related  to  the  administration  of  questionnaires  are 
considered  in  this  chapter  since  sucii  matters  are  obviously  of  concern 
when  questionnaires  are  constructed.  The  effects  of  instructions  upon 
questionnaire  results  are  first  discussed,  followed  by  sections  on  the 
effects  of:  various  motivational  factors;  anonymity;  administration  time; 

characteristics  of  questionnaire  administrators;  administration  condi- 
tions; and  other  factors  such  as  bias  and  halo. 


Ef fe c t s of  Instruction s 


Several  studies  discussed  the  amount  of  variance  in  responses  due  to 
variations  in  giving  instructions.  Some  of  the  variance  in  instructions 
is  unintentional,  which  was  indicated  in  a study  conducted  by  Belson  (un- 
dated, a). In  that  study  236  tape  recorded  interviews  were  conducted,  in 
which  respondents  were  asked  to  use  the  semantic  differential  scaling  system. 
The  interviewers  were  told  to  deliver  the  printed  instructions  word  for 
word.  Analysis  of  the  tape  transcripts  showed  only  2%  of  the  instructions 
were  delivered  word  for  word.  Deviations  from  the  Instructions  took  the 
following  forms:  total  phrases  were  eliminated  with  considerable  ad  libbing;H 

and  key  words  intended  to  focus  the  respondent's  attention  on  some  specific 
part  of  the  instructions  were  frequently  omitted  or  changed.  The  deliveries 
were  rated  for  accuracy  in  presenting  the  34  basic  ideas  in  the  instructions 
in  the  average  delivery.  As  a result  28%  of  the  key  ideas  v)ere  lost,  main-  ^ 
ly  through  omission.  The  variability  of  the  interviewer  performance  varied 
substantially  both  across  interviewers  and  within  individuals. 

Madow  (1965)  stated  that  the  interviewer's  attitude  toward  the  question 
communicates  itself  sufficiently  to  the  respondent  to  alter  the  meaning  of 
the  question.  He  concluded  that  the  nature  of  the  survey  and  the  survey 
organization  are  determining  factors  in  whether  or  not  the  interviewer  must 
follow  the  interviewer  schedule  verbatim  or  may  vary  the  wording. 

Instructions  are  often  varied  in  experiments  to  induce  response  sets. 
These  experiments  usually  use  standard  instructions  and  instructions  to 
fake.  In  a study  by  Winters  and  Bartlett  (1966),  a forced  choice  scale 
was  constructed  to  provide  independent  measures  of  two  types  of  response 
tendency,  acquiescence  and  social  desirability.  The  scale  was  administered 
under  standard  and  faking  instructions.  Factor  analysis  yieldeU  a social 
desirability  factor  under  each  instructional  set,  and  an  acquiescence 
factor  only  under  standard  instructions.  Social  desirability  scores  were 
observed  to  be  orthogonal  between  ins  true t iona 1 conditions.  In  another 
study  conducted  by  Bartlett  and  Doorly  (1967)  using  a forced  choice  scale 
measuring  social  desirability,  tlie  authors  found  that  different  instruction- 
al sets  do  affect  the  tendency  to  answer  in  a socially  desirable  way. 

Lcderman  (1971)  administered  two  formats  of  the  Thorndike  Dimensions  of 
Temperament  to  college  students  under  regular  directions  and  under  instruc- 
tions to  give  socially  desirable  responses.  He  found  that  the  forced  choice 
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forii’-u*'  prodiicfd  law  stale  iutercorre Ja tlons  ui.der  regular  dircctlynn,  but; 
under  the  dec  1 ral>l  1 1 Ly  <HrecLlonH  a connnon  factor  appeared,  lie  also  found 
that  the  same  factor  appeared  in  the  queat  I onna  I re  format  under  botli 
reuulur  and  deolrablllty  d I rect  lorio  . rreiich  (1958)  found  '.liat  algnlf  leant 
differences  were  obst*rved  in  mosi.  of  Llie  ucaleu  of  the  ''^dwards  Personal 
Preference  Schedule  under  different  1 no  true t lono  . 

Rainbo  (1968)  found,  ufilnf,  the  Illtickley  Scale  of  Attitude  toward 
Negroes  and  a I.lkcrt  ocale,  tliat  tlie  magnilude  of  tlie  linear  association 
between  the  scales  v/as  infl'ienced  by  tlie  lnsLruct-l.pns  presented  to  the 
subjects,  Frcderllcsen  and  Messick  (1958)  found  Hiat  Instructions 
altered  mean  crltlcalntss  oet  scores  In  the  expected  direction  to  an  extent 
tliat  it  was  significant  on  one  ,/f  the  three  tctxs  used,  nearly  significant 
on  another,  and  nono  Ignif  leant  on  the  tlilrd.  And  JJloxom  (1968)  found  tliat 
mildly  anger  arousing  printed  In/itnicUonf)  when  compared  with  non-o'^' 
arousing  instruc tio-’o  elicited  more  reoo<  rses  of  negative  oelf-rej' 

Jarrett  and  Sherriffa  (1956)  concluded  that  telling  people  to 
every  item  on  a questionnaire  or  to  omit  an  item  if  there  is  clearly  o 
difference  does  not  yield  different  results,  Miron  (1961)  found  tl'.at  un 
ing  subject?  to  answer  a retest  of  a questionnaire  the  same  vay  they  had 
answered  the  original  questionnaire  was  superior  to  nonrecall  conditions 
with  respect  to  mean  absolute  test-retect  deviations, 

Berger  and  Sullivan  (1970)  examined  an  hypotliesis  that  iristi  iictions 
emphasizing  a respondtm t ' s Importance  in  an  attitude  survey  would  result 
in  a reduced  number  of  "don't  khow"  responses  to  the  items,  A 20  item 
questionnaire  was  administered  to  180  undergraduates  under  tliree  contexts: 
face-to-face  Interviews;  teleplione  interviews;  and  group  administration, 
Contrary  to  the  hypothesis,  the  teleplione  Interviews  and  group  administra- 
tion cont<*xts  yielded  h Ign  I f icantly  more  "doii'l;  knows'  under  tlie  instructions 
emphasizing  the  respondent's  importance  than  under  the  control  instructions. 
There  was  no  diffeicnce  beeween  Instructional  sets  in  the  face-to-face 
context , 

From  the  above  dlscinslon  it  appears  that  Instructions  do  effect  the 
responses  collected  by  ques  1 1 onnxi  I res  , It  also  appears  that  more  syste- 
matic research  is  needed  to  determine  the  range  of  variations  in  instrec- 
tioriii  that  may  affect  the  results  gi^en  on  quostionnalres , and  the  effects 
of  va-lations  in  respondent  understanding  of  Instructions, 

# 

i: f f e cts  of«Varioiin  Motivational  Factors 


In  this  scclIoii,  the  effects  of  varloua  motivational  factors  aic  con- 
sidered. The  effects  of  a lack  of  respondent  motivation  will  first  bo 
briefly  considered.  Attention  will  then  be  given  to  factors  that  affect 
the  rate  of  return  of  questionnaires.  Respondent  preferences  for  certain 
item  formats  will  ncx*  be  reviewed,  followed  by  a discussion  of  the  effects 
of  the  behavior  of  the  administrator  on  questionnaire  response. 
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K I' f c: I n <) r hi c: k of  ri L mo I . Some  of  the  uLiidles  deal  In)' 

will)  mo  t ( Vti  L f '.'HI  (ilticiifiM  tile  effect  of  luck  of  mo  ('1,  vat  Ion  on  re/jponfles  to 
f l eiiti) , F I a 1,1  j'. in  (195;j)  teHled  Llie  elleclii  of  motivation  in  a Btiidy  of  two 
gt-oiipa.  One  >'roii|>  'ac;;  ri'[iorted  an  liavloi'  lilfjli  motivation:  a Broiip  of  Air 

Force  AvL.itlon  cadet.i  taklni'  Llui  Aircrew  Job  Elcmento  Aptitude  TchCb.  The 
ntlier  f'.roiip  war  reported  aw  Iiavinj’,  low  motivation  and  conalotcd  of  cenlorH 
in  their  hint  two  wee'  a of  iichool  or  fitudentfi  involved  In  five  days  of 
testlnj;.  The  incidence  of  patterned  re.Tpnnfios  or  random  rccponscs  were 
li4^!ie**  In  Llioae  groiipn  where  the  motivation  was  Buspect. 

Hart,  FauoL,  Kowlnnd,  and  [.ucler  (I'KiA)  found  tliat  rcHpondente  who 
iiuule  tliree  or  more  co.iii  1 1, l ent  erroro  in  a HUidy.  to  noBeHfi  tlie  attltudeB 
of  troopK  1.1  ilie  tropica  v;ere  more  ncij'at've  in  attitude.  Levy -TvC boy er 
(•.955)  Blmllarly  fourul  tliat  omloalonfi  on  aehlevemen t and  intelligence  tCBtfi 
were  more  doe  to  motivation  at  the  moment  tluin  to  any  peraletlng  psych”'- 
loglcal  trait.  Finally  Kendall  (1954),  In  a Btudy  of  factore  which  Bceincd 
to  contr ll/ote  to  un.stuhle  rcHpomieo  people  make  to  attitude  qucBtlonnoircfl, 
found  that  ohlfts  in  the  mood  of  the  reflpondenth  and  the  degree  of  Intcrcot 
ill  or  concern  with  the  qiientlona  po.sed,  affected  reBiilts. 

Factora  a flee  tin)'  the  rate  of  return  of  quo  b t lonna  Iron . Some  of  the 
factors  that  affect  the  rate  ef  roi.nni  of  questlonrialrcs  arc  reviewed  below, 
Included  are:  the  effect  of  ego  Ineolvliif'  tlie  Biibjcct  in  the  study ; the 

uBf  of  advance  lecterfi,  covur  lettere,  and  other  tcchniqueB  to  stimulate 
the  return  of  mailed  cpies t limna Iren ; and  other  factora. 

Three  Htiulle  . ex.inilned  the  effect  of  ego  (,  /volving  the  aubject  in  Che 
Etudy,  Slocum,  En,K*>  and  Swanson  (1956)  found  that  efforts  to  establish 
an  image  of  the  1...1  utility  of  a survey  and  Co  emphasize  the  special 
role  of  each  retipo.icleo'  maximized  th'.!  rcBporiscB  to  a questionnaire  and 
structured  intervlev;,  Slel  to  (1940)  fciml  that  three  different  cover 
TetterB  did  not  h Ign  If  Icantly  affect  tho  rate  of  return  of  moiled  question- 
naires, One  cover  letter  leqtiestcd  help  on  an  aitruloflc  basis,  another 
on  a chnllenge  basis,  and  a third  c a "lielp  uh"  basis.  In  contrast  to 
the  .Sletto  (1940)  study,  Champion  and  Sear  (1969)  found  that  egoistic 
cover  letters  produced  greater  renponse  rates  than  altruistic.  This  was 
found  true  eapocl.ally  in  the  case  of  Tower  class  respondents.  Calm  Ian 
(1951)  conducted  a atiidy  In  which  n questionnaire  dealing  with  Army  interestb 
wan  mailed  to  1,051  Army  officers.  The  author  felt  the  84%  return  rate 
was  due  to  the  Interaction  of  tlie  institutional  control  applied  by  the 
Army,  or  tlie  traditional  rcsponsl hi  1 Ity  of  Army  officers,  with  the  wording 
of  the  cover  letter  wlilch  stressed  responuibl  llcy  and  requested  a return 
w I til  In  five  days. 

In  ezanilnatlnrifl  of  the  effect  of  letCors  sent  In  advance  of  the 
questionnaire  on  retiporifje  rates,  Ford  (l.9o7)  found  that  such  letters  sig- 
nlficuutly  Improved  reoponuo  rates.  Myers  and  Hang  (1967)  also  found  that 
an  initial  letter  Increased  response  rates  s (g,nlf  Icantly , lint  firunner  and 
Carroll  (1969)  found  that  tlie  Initial  letter  did.  not  slgniClcantly  reduce 
Interviev/  refiioai  rate. 


Glickman  (1962)  concluded  that  repeated  administration  did  not  have 
an  adverse  effect  on  the  proportion  of  subjects  returning  questionnaires. 
Durant  and  Maas  (1956)  also  found  that  people  previously  approached  respond 
ed  more  readily  a second  time. 

Several  studies  have  been  conducted  to  determine  what  techniques  are 
useful  to  stimulate  the  return  of  mailed  questionnaires.  One  technique 
involved  including  a stamped  and  addressed  return  envelope  with  the  question 
naire.  Ferris  (1951)  determined  that  including  the  stamped  and  addressed 
return  envelope  increased  the  response  rate  53%.  Clausen  and  Ford  (1947), 
however,  concluded  that  including  prepaid  return  envelopes  did  not  influence 
the  return  of  questionnaires. 

Another  technique  that  was  studied  was  the  followup  with  reminders  to 
complete  questionnaires.  Myers  and  Hang  (1967)  got  a response  rate  of  28% 
for  a group  that  had  a followup  letter  mailed  one  week  after  the  mailing 
of  a questionnaire,  and  a 28.97„  response  rate  from  a group  treated  similar- 
ly but  with  no  followup.  Clausen  and  Ford  (1947)  also  concluded  that 
followup  had  no  effect  on  response  rate.  Ferris  (1951)  similarly  concluded 
that  prodding  with  reminding  postcards  does  not  increase  response  rate.  But 
Watson  (1965)  did  find  that  followup  postcards  increased  response  rates 
from  30%,  to  40%,  and  two  day  follov^up  mailing  to  the  entire  sample  raised 
the  response  rate  from  307.  to  46%..  Leslie  (1970)  recommended,  without  data 
to  support  it,  the  use  of  second,  third,  and  fourth  mailings  and  followup 
with  personal  calls  to  improve  response  rates. 

A third  technique  which  was  studied  involved  the  type  of  postage  used 
to  send  the  questionnaire.  Watson  (1965)  obtained  the  same  response  rate 
using  air  mail  and  titird  clast  mailing.  Clausen  and  Ford  (1947)  concluded 
that  special  delivery  did  not  influence  response  rate.  Champion  and  Sear 
(1969)  came  to  the  same  conclusion. 

The  effect  of  incentives  on  return  rates  of  mailed  questionnaires  has 
been  studied  by  Brennan  (1958).  He  tested  the  hypothesis  that  trading 
stamps  would  be  an  effective  incentive  to  improve  the  response  return 
rates  in  mail  surveys.  A questionnaire  was  sent  out  without  incentives, 
with  50  trading  stamps  included,  or  with  the  premise  of  100  trading  stamps 
or  25d  upon  return  of  the  completed  questionnaire.  The  results  showed  no 
significant  diffeiences:  the  average  return  rate  under  all  three  conditions 

was  approximately  27%.  Watson  (1965),  however,  received  a return  rate  of 
40%  for  a lOc  incentive,  of  417.  when  a packet  of  stamps  was  used,  and  of 
48%  when  a 25d  incentive  was  used. 

Ferris  (1951)  found  that  responses  were  most  frequently  mailed  on 
Thursday  and  Friday,  lea.=  t frequently  on  Saturday  and  Sunday.  In  turn, 
Leslie  (1970)  suggested  mailing  questionnaires  so  that  they  arrive  on  Thurs- 
day or  Friday. 

Simon  (1967)  investigated  the  effect  of  personally  typed  cover  letters 
versus  mimeographed  form  letters  on  response  rate  in  mail  surveys.  He 
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found  that  there  was  no  overall  clearcut  advantage  for  personally  typed 
cover  letters  in  terms  of  respot.se  rate.  He  also  noted  the  possibility 
that  in  some  cases  personally  typed  cover  letters  may  reflect  lack  of 
anonymity  and  may  therefore  decrease  response. 

Ford  (1968)  demonstrated  that  printed,  folder-type  questionnaires 
generated  higher  responses  than  mimeographed,  stapled  questionnaires. 

This  supports  Leslie  (1970)  in  the  suggestion  that  long  questionnaires  be 
printed,  not  mimeographed,  because  printing  reduces  the  length  by  two- 
thirds.  However,  Durant  and  Maas  (1956)  found  that  having  to  fill  in  two  ques- 
tions did  not  greatly  increase  cooperative  response  over  having  to  fill  in  a 
53  item  questionnaire,  which  suggests  that  length  may  not  be  related  to 
motivation  for  returning  a questionnaire. 


Respondent  preferences  for  certain  item  formats.  Several  studies 
demonstrated  subjects'  preferences  for  specific  types  of  item  formats.  In 
some  cases  this  preference  did  not  seem  to  have  an  effect  on  the  results. 
Steinbeck  (1972)  found  that  a format  in  which  subjects  rated  themselves  on 
given  items  on  a nine  point  scale  was  more  acceptable  than  a format  in 
which  they  had  to  select  items  most  or  least  like  themselves.  Zavala  (1965) 
found  that  raters  prefer  forced  choice  formats  using  four  favorable  items 
from  which  they  choose  the  items  most  characteristic  of  the  person  rated. 
Waters  and  Wherry  (1961a)  determined  that  subjects  were  more  favorable 
toward  a response  format  allowing  them  to  indicate  the  degree  of  applica- 
bility of  each  statement  in  the  forced  choice  pairs  than  they  were  towards 
other  forced  choice  formats.  Waters  (1966)  also  reported  that  a subject's 
reaction  to  a forced  choice  scale  is  more  favorable  when  some  method  is 
incorporated  whereby  he  is  given  the  opportunity  to  indicate  the  degree  of 
applicability  of  each  item  to  himself.  Gaito  (1962)  speculated  that  forced 
sort  Q-sorting  techniques  may  adversely  influence  a subject's  spontaneity. 
Turgut  (1963)  showed  that  57%  of  a group's  subjects  preferred  the  format 
of  Edwards  Personal  Preference  Schedule,  and  327  liked  the  Q-sort  format. 
Jones  (1968)  showed  that  subjects  clearly  preferred  multiple  category 
options  over  two  category  options.  Subjects  also  reported  that  multiple 
choice  and  true-false  continuums  were  more  interesting  than  the  dichoto- 
mous true-false  format.  Hughes  (1967)  reported  that  a check  list  was  pre- 
ferred over  the  semantic  differential  when  both  were  first  administered  to 
subjects,  but  preference  for  the  semantic  differential  increased  from  11% 
to  347o  in  a retest  situation  while  the  preference  for  the  check  list  de- 
clined from  57%,  to  407=.  Hughes  attributed  the  increased  preference  for 
semantic  differential  to  the  respondents  becoming  more  familiar  with  it. 
Matell  (1970)  suggested  that  in  constructing  Likert-type  scales,  that  the 
number  of  steps  should  be  chosen  by  the  respondent's  preference. 


Effects  of  the  behavior  of  the  administrator  on 
studies  concluded  that  reinforcing  behaviors  of  the 
ministrator,  or  data  collector  have  an  influence  on 
ed.  The  effect  of  the  experimenter's  influence  was 
the  responses  first  gradefs  from  middle  and  working 
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interviewer,  test  ad- 
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under  the  conditions  of  reinforcing,  neutral,  and  non-supporting  atmos- 
pheres induced  by  the  experimenter  (Sgan,  1967).  Results  indicated  that 
middle  class  children  were  more  susceptible  to  the  experimenter's  influ- 
ence and  that,  under  a reinforcing  atmosphere,  they  were  significantly 
more  apt  to  change  their  preferences.  Stember  and  Hyman  (1949)  conclud- 
ed that  interested  respondents  were  more  subject  to  interviewer  effects 
than  uninterested  ones.  Wickes'  (1956)  findings  suggested  that  such 
comments  as  "good"  or  "fine"  and  such  actions  as  smiling  and  nodding  by 
examiners  have  a decided  effect  upon  test  results. 

Marquis,  Marshall,  and  Oskamp  (undated)  reported  that,  although  re- 
spondents liked  the  interview  more  when  the  interviewer  was  supportive, 
his  manner  had  no  effect  on  the  accuracy  or  completeness  of  the  responses. 
However,  Marquis,  Canneil,  and  Laurent  (1972)  reported  that  the  interview- 
er's use  of  reinforcement  increased  the  accuracy  of  reports  from  respond- 
ents who  had  not  completed  high  school,  and  had  the  opposite  effect  on 
those  who  had.  Field  (1955)  found  that  praised  respondents  in  public  opin- 
ion interviewing  situations  tended  significantly  to  offer  more  answers 
than  the  unpraised  ones.  Praising  respondents  tended  to  reduce  "don't 
know"  answers,  but  praising  did  not  increase  insincere  or  dishonest  responses. 

In  a study  conducted  by  Hildum  and  Brown  (1956),  it  was  found  that 
"good"  proved  to  bias  results  in  a phone  attitude  survey  while"mm-humm"  did 
not.  Matarazzo,  et  al  (1964)  reported  a 317,  increase  in  subjects'  averaee  dura- 
tion of  single  utterances  when  the  interviewer  said  "mm-hmm"  all  the  time  f he 
subject  was  talking.  In  a cross  validation  study  there  was  an  8U%  increase 
in  the  mean  duration  of  single  units  of  interviewee  speech.  Dixon  (1970) 
used  subjects  who  were  high  or  low  on  a social  desirability  scale  in  an 
experiment  using  reinforcement  to  increase  se 1 f-referent  statements.  The 
reinforcement  was  done  by  the  interviewer  saying  "good"  after  every  sentence 
using  "I"  or  "We."  High  social  desirability  subjects  responded  to  reinforce- 
ment by  increasing  equally  the  frequency  of  both  positive  and  negative  self- 
referent statements.  Low  social  desirability  subjects  did  not  condition, 
but  continued  to  make  more  positive  than  negative  self-references . 


Effects  of  Anonymity 

Several  studies  have  been  carried  out  to  determine  the  effect  of 
anonymity  on  questionnaire  responses.  Pearlin  (1961)  carried  out  a study 
in  which  he  found  that  people  selecting  anonymity  in  filling  oul  a ques- 
tionnaire had  different  characteristics  than  those  who  did  not.  A ques- 
tionnaire was  administered  to  the  nursing  force  of  a large  Federal  mental 
hospital.  Respondents  were  given  the  option  of  anonymity.  It  was  found 
that  those  selecting  anonymity  were  no  more  negative  in  their  attitudes 
on  a number  of  critical  issues  than  were  those  who  signed  their  question- 
naires. Anonymous  respondents  were  more  subject  to  feelings  of  incompetto'-e 
as  reflected  by  their  low  scores  on  a measure  of  self-regard,  by  their 
reluctance  to  voice  opinions  at  work,  and  by  their  reported  difficulty  in 
coping  with  the  questionnaire.  A second  distinguishing  characteristic  of 
the  anonymous  respondents  was  their  generally  cautious  view  of  people  about 
them  and  their  motives.  Finally,  it  was  shown  that  the  anonymous  respond- 
ents had  less  involvement  and  interest  than  signers  in  the  issues  covered 


LX-6 


by  the  questionnaire.  Based  on  these  findings,  Pearlin  suggested  that 
anonymity  in  administration  is  useful,  but  for  reasons  other  than  the  pre- 
vention of  the  arousal  of  fear  or  threat. 

A few  studies  provide  evidence  that  anonymity  is  affected  by  more 
than  signing  or  not  s'gning  a questionnaire.  Wiseman  (1972)  noted  that 
questionnaires  provide  more  anonymity  than  interviews.  Metzner  and  Mann 
(1952)  conducted  a study  in  which  a fixed  alternative  questionnaire  was 
compared  to  an  open-ended  interview  with  328  employees  in  an  electric 
utility  plant.  The  subjects  were  given  an  attitudinal  questionnaire  about 
their  job  with  five  scaled  responses  to  choose  from,  followed  two  months 
later  with  a personal  interview  asking  similar  questions.  The  respondent's 
anonymity  was  assured  both  times.  Blue  collar  workers  were  more  confident 
of  the  anonymity  of  the  questionnaires,  while  white  collar  workers  felt 
the  interviews  were  not  less  anonymous  than  the  questionnaires.  In 
general,  the  interviews  yielded  higher  proportions  of  satisfied  responses 
than  the  questionnaires. 

Knudsen,  Pope,  and  Irish  (1967)  collected  data  from  samples  of  pre- 
maritally  pregnant  white  women  by  three  methods.  The  first  sample  anony- 
mously completed  questionnaires  in  their  physician's  office,  the  second 
sample  was  interviewed  confidentially,  and  the  third  filled  out  question- 
naires in  the  presence  of  an  interviewer.  Data  suggested  that  in  the  inter- 
view situations  the  respondent  was  more  likely  to  support  the  public  and 
restrictive  sexual  norms  that  she  assumed  were  adhered  to  by  the  interview- 
er. Lower  socioeconomic  respondents  deferred  to  the  norms  represented  by 
the  higher  status  interviewers.  In  the  private  and  anonymous  questionnaire 
situation,  the  respondents  more  often  answered  to  subcultural  norms. 

Pimon  (1967)  , in  his  article  on  personally  typed  cover  letters  versus 
mimeographed  form  cover  letters  in  mail  surveys,  advanced  the  possibility 
that  in  some  cases  the  personally  typed  cover  letters  may  reflect  a lack 
of  anonymity  to  respondents  even  though  it  is  assured  because  the  letters 
are  addressed  to  them  personally. 

Hamel  and  Reif  (1952)  e.xplored  the  question  of  differences  in  response 
due  to  signing  or  not  signing  the  Employee  Attitude  Questionnaire.  They 
found  that  essentially  the  same  responses  were  obtained  for  individuals  in 
identified  or  anonymous  groups.  They  speculated,  however,  that  these 
results  may  have  been  influenced  by  the  fact  that  the  staff  of  a university 
organization  administered  the  questionnaires  and  respondents  were  repeat- 
edly assured  that  the  questionnaire  would  only  be  used  for  confidential 
research  purposes. 

Dunnette  and  Heneman  (1956)  investigated  the  effects  on  attitude 
responses  of  the  identity  of  the  survey  administrator.  Two  employee 
samples  were  selected  randomly  from  the  total  work  force  of  a large  de- 
partment store.  The  IRC  Employee  Attitude  Questionnaire  was  administered 
to  one  group  by  an  Industrial  Relations  Center  staff  member  and  to  the 
second  group  by  the  personnel  manager  of  the  store.  The  group  which  was 
given  the  questionnaire  by  the  manager  responded  more  favorably  to  the 
attitude  survey  than  the  other  group.  The  same  group  tended  to  give  few- 
er and  shorter  responses  to  open-end  questions  than  the  employees  who  were 
given  the  questionnaire  by  the  Industrial  Relations  Center  staff  member. 
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Klein,  Maher,  and  Dimnington  (1967)  compared  attitude  survey  responses 
between  identified  and  nonidentlfied  manufacturing  employees  made  under 
two  conditions  of  identification.  One  condition  involved  a face-to-face 
designation  by  the  respondent's  manager  as  to  which  group  he  was  to  be  in 
(high  threat),  and  the  other  involved  a random  allocation  as  the  respondent 
entered  the  testing  room  (low  threat).  All  subjects  were  assured  confidentially 
of  their  responses  and  the  nonidentif ied  respondents  were  assured  anony- 
mity. A positive  distortion  in  responses  took  place  under  both  identified 
conditions,  but  signif icantly  more  under  high  threat. 

Bartlett  and  Sharon  (l969)  determined  the  effects 
of  several  instructional  rating  conditions  on  leniency  on  a graphic  and 
forced  choice  rating  scale.  Approximately  1,000  undergraduate  psychology 
students  rated  their  instructors  under  instructional  conditions  indicating 
that  the  ratings:  will  be  anonymous  and  will  be  used  for  research  purposes 

only;  may  be  used  for  evaluation  purposes;  will  be  identified  by  having 
the  rater  place  his  name  on  the  rating  form;  or  will  have  to  be  explained 
to  the  ratee  by  the  rater.  A significant  leniency  effect  was  found  with 
the  graphic  ratings  which  were  to  be  used  for  evaluation  purposes  and 
those  that  had  to  be  justified  to  the  ratee.  It  was  concluded  that  the 
forced  choice  scale  was  quite  resistant  to  leniency  bias,  however. 

Some  investigators  found  no  differences  in  anonymous  versus  non- 
anonymous  conditions.  Edwards  (1957a)  found  that  assurance  of  anonymity 
did  not  eliminate  or  drastically  change  the  nature  of  the  relationship  pre- 
viously found  between  probability  of  endorsement  and  social  desirability 
scale  value  where  the  assessments  were  not  made  anonymously.  Ash  and 
Abramson  (1952)  concluded  that  the  verbally  expressed  attitudes  of  college 
students,  as  recorded  on  scales  relating  to  ethnocentrism,  political- 
economic  conservatism,  and  anti-Negro  prejudice,  were  not  biased  in  either 
a more  'pro'  or  more  'anti'  direction  as  a result  of  the  requirement  that 
they  sign  the  scales,  thus  identifying  themselves.  Gcrberich  and  Mason 
(1948)  found,  in  the  administration  of  a questionnaire  on  academic  back- 
ground, plans,  and  study  habits  to  2,876  students  taking  a biological 
science  course,  that  there  were  no  significant  differences  between  signed 
and  unsigned  questionnaires.  As  mentioned  above,  Hamel  and  Reif  (1952) 
also  found  no  differences  in  signed  versus  unsigned  questionnaires.  But 
Corey  (1937)  found,  in  the  administration  of  a questionnaire  on  cheating, 
that  mean  scores  reflected  a slight  but  statistically  iusignif icant 
tendency  for  more  sympathetic  attitudes  toward  cheating  to  be  expressed 
on  anonymous  papers. 

Some  studies  have  indicated  that  different  results  were  obtained  in 
anonymous  versus  nonanonymous  situations.  Fischer  (1946)  gave  a psycho- 
logical problem  checklist  to  102  female  psychology  students,  first  with 
signatures  required,  then  a week  later  without  signatures  required.  The 
results  indicated  that  the  mean  number  of  problems  listed  did  not  vary 
significantly  under  the  two  conditions,  but  the  mean  number  of  serious 
problems  listed  tended  to  be  significantly  greater  when  signatures  were 
not  required.  In  a study  conducted  by  Olson  (1936)  a personality  test  to 
measure  emotional  instability  was  given  to  two  comparable  groups  of 
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college  women,  one  group  remaining  anonymous,  the  other  group  signing 
their  names.  Subjects  reported  significantly  more  feelings  and  symptoms 
with  neurotic  implications  under  anonymous  conditions  than  when  required 
to  sign  their  names. 

The  effects  of  anonymous  versus  nonanonymous  data  collections  seems 
to  be  related  to  the  content  of  the  data.  Wiseman  (1972)  conducted  a 
public  opinion  poll  on  current  social  issues  using  tliree  samples  of 
Boston  households  controlled  for  socioeconomic  factors.  Data  were  col- 
lected from  one  sample  by  mailed  questionnaires,  from  the  second  by 
telephone  interview,  and  from  the  third  by  personal  interview.  For  eight 
of  the  ten  questions,  the  results  under  the  three  conditions  were 
similar;  but  for  two  questions,  one  on  contraception  and  one  on  abortion, 
the  results  differed  significantly.  On  the  anonymous  questionnaire  more 
people  seemed  to  bo  in  favor  of  such  programs  than  in  either  the  telephone 
or  personal  interview.  Wiseman  concluded  that  sensitive  issues  involving 
socially  accepted  or  rejected  answers  will  effect  more  response  bias  in 
interviews  than  in  questionnaires. 

In  the  Klein,  Maher,  and  Dunnington  (1967)  study  described  above, 
items  themselves  produced  variable  distortion.  Items  dealing  with  salary 
and  with  ratings  of  top  management  produced  consistent  positive  distortions 
under  identified  conditions,  whereas  items  dealing  with  work  pressure  and 
the  subject's  manager  produced  little  or  no  distortion.  Dunnette  and 
Heneman  (1956)  also  found  that  the  amount  of  response  distortion  depended 
upon  tlie  content  of  the  items  comprising  the  questionnaires. 

Rosen  (1960),  as  a result  of  a study  with  college  freshmen  completing 
signed  or  unsigned  questionnaires  on  the  effectiveness  of  a reading  program, 
concluded  that,  when  respondent  identification  is  essential  for  correlation- 
al or  followup  purposes,  the  straight-forward  approach  is  preferable  to  a 
number  coding  system.  For  sensitive  issues  or  where  there  is  expected 
distortion,  it  may  be  advisable  to  use  an  anonymous  questionnaire.  The 
other  articles  discussed  above  appear  to  support  Rosen's  recommendation. 

In  summary,  it  appears  that  anonymity  depends  not  only  on  unsigned 
questionnaires  but  also  on  the  conditions  under  which  the  questionnaires 
are  administered.  In  addition,  it  appears  anonymity  only  makes  a difference 
when  information  on  sensitive  areas  is  collected. 


Effec ts  of  Administration  Time 


Most  of  the  studies  conducted  to  evaluate  the  effects  of  administra- 
tion time  were  done  using  achievement  and  performance  tests  which  arc 
beyond  the  scope  of  this  study.  The  data  were  not  pertinent  to  this  re- 
view as  they  were  based  on  the  number  of  riglit  answers  or  total  individual 
scores.  The  related  topic  of  questionnaire  length  is  discussed  in  Chapter 


VIII. 


Miron  (1961)  in  a study  using  the  semantic  differential  did  vary 
directions  in  terms  of  how  much  time  the  subject  was  to  taxe  to  respond  to 
an  item.  One  group  was  instructed  to  mark  all  items  at  a fairly  rapid 
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pace  and  to  attempt  to  recall  and  duplicate  their  markings  on  the  immed- 
iate retest.  A second  group  was  instructed  to  proceed  at  a slow  rate 
throughout  the  testing  and  to  recall  and  duplicate  markings  on  the  retest. 

A third  group  was  instructed  not  to  try  to  recall  original  testing  judg- 
ments and  to  proceed  rapidly  on  the  retest.  The  fourth  group  was  instructed 
not  to  recall  but  to  proceed  slowly.  Test-retest  correlationswere  computed 
for  each  of  the  groups  and  were  all  uniformly  high.  The  standard  errors 
of  substitutions  were  found  to  range  between  .24  and  .32  for  groups  one  and 
three  respectively,  with  an  average  absolute  deviation  range  of  .10  to  .14. 

It  appears  from  the  lack  of  studies  in  this  section  that  further 
research  is  needed  on  the  effects  of  administration  time  on  subject's 
motivation  and  on  the  effects  of  setting  time  limits  for  completing  ques- 
tionnaires . 


Effects  of  Characteristics  of  Questionnaire  Administrators 


This  section  reviews  the  effects  that  certain  characteristics  of 
questionnaire  administrators  have  on  the  responses  received  from  the  people 
completing  the  questionnaire.  Some  of  the  studies  dealing  with  interview- 
ers appear  to  be  genera lizable  to  questionnaire  administrators. 

Sex  of  the  interviewer.  Colombotos,  Elinson,  and  Loewenstein  (1969) 
studied  the  effect  of  the  interviewer's  sex  on  interview  responses.  They 
found  essentially  no  difference  in  the  reporting  of  psychiatric  symptoms 
to  male  and  female  interviewers  in  a community  survey.  However,  they  did 
speculate  that  differences  in  response  patterns  according  to  the  interview- 
er's sex  may  depend  on  subject  matter  as  well  as  on  the  composition  of  the 
respondent  populations  and  other  characteristics  of  the  specific  survey 
situation.  They  recommended  that  the  rationales  commonly  presented  for 
having  either  male  or  female  interviewers  be  critically  reexamined.  Thumin 
(1962)  found  that  the  percent  of  people  admitting  to  insomnia  differed  sig- 
nificantly according  to  the  sex  of  the  interviewer.  Twenty-two  percent  of 
the  subjects  interviewed  by  male  interviewers  reported  having  insomnia, 
compared  to  13%  of  the  subjects  interviewed  by  females.  No  interaction 
effects  between  sex  of  interviewer  and  sex  of  subject  was  found.  In  a 
study  conducted  by  Boyd  and  Westfall  (1965a), it  was  found  that  women  had 
better  ratings  as  interviewers  than  men. 

Race  of  the  administrator.  Most  of  the  studies  conducted  to  determine 
if  the  race  of  the  Investigator  had  an  effect  on  responses  involved  ques- 
tionnaires concerned  with  race.  In  a study  conducted  by  Summers  and 
Hammonds  (1966)  a Negro  attitude  scale  was  administered  by  two  investigators. 
In  a portion  of  the  groups  both  investigators  were  white.  In  the  remainder 
of  the  groups  there  was  a Negro  and  a white  investigator.  The  results  in- 
dicated that  socially  acceptable  answers  to  the  Negro  attitude  scale  were 
reported  with  greater  frequency  when  one  of  the  investigators  was  Negro. 
However,  the  phenomenon  was  mure  pronounced  among  certain  strata  of  respond- 
ents than  among  others,  suggesting  that  the  effects  should  be  viewed  as 
the  result  of  interaction  between  investigator  and  respondent  characteris- 
tics. Similarly,  Sedlacek  and  Brooks  (1972)  measured  the  attitudes  of 
whites  toward  blacks  with  the  Situations  Attitude  Scale.  Results  indicated 


that  there  were  no  measurable  effects  attributable  to  the  race  of  the 
person  administering  the  Situational  Attitude  Scale. 

Sattler  (1970)  reported  on  a comparative  review  of  studies  that  con- 
sidered the  effects  of  the  experimenter 's /interviewer 's  race  on  physiolog- 
ical responses,  task  performance,  intelligence  testing,  personality  scores, 
attitudes  and  preferences,  speech  patterns,  interviewing,  and  psychothera- 
peutic or  counseling  relationships.  One  finding  reported  was  that  respond- 
ents more  often  gave  socially  desirable  answers  to  interviewers  whose  race 
was  different  from  theirs,  particularly  if  their  social  status  was  lower 
than  that  of  the  interviewer  and  the  topic  of  the  question  was  threatening. 

Trent  (1954)  hoped  to  find  an  effect  of  the  investigator's  race  on 
responses  by  asking  black  and  white  kindergarten  children  to  select  their 
mother  from  three  photographs  of  models  (one  black,  one  brown,  and  one  white) 
using  either  a white  or  black  investigator.  He  found; 

1.  When  testing  white  children,  a white  investigator  produced 
more  responses  to  white  and  brown  mothers;  but  with  a black 
investigator,  the  choices  v/ere  for  a black  or  white  mother. 

2.  The  results  differed  little  with  black  youngsters  except  that 
with  the  black  investigator  they  chose  brown  or  black  mothers 
more  often. 

3.  The  black  children  avoided  making  any  decision  25%  of  the 
time  with  a white  investigator  but  not  at  all  with  a black 
investigator,  while  there  were  no  evasions  by  the  white 
children  in  either  condition. 

The  results  of  the  study  by  Schuman  and  Converse  (1971)  indicated 
that  the  race  of  the  interviewer  affected  only  responses  to  questions 
about  militancy  and  hostility  toward  whites,  and  not  responses  to  non- 
racial  questions  or  questions  about  discrimination.  Black  interviewers 
obtained  more  militant  answers,  particularly  from  lower  socioeconomic 
status  black  respondents.  White  reports  to  white  interviewers  indicated 
a higher  level  of  militancy  for  upper  than  lower  income  blacks,  while 
black  interviewers  received  a fairly  even  distribution.  Schuman  and 
Converse  attributed  this  difference  to  a "white  effect"  rather  than  to  a 
"black  effect."  One  other  question  on  favorite  entertainers  showed  dif- 
ferences by  race  of  interviewer,  indicating  that  the  interviewers'  race 
can  establish  different  frames  of  reference  even  in  nonsensitive  areas. 

In  a study  by  Babatz  (1967)  120  Negro  undergraduates  were  adminis- 
tered the  Test  Anxiety  Questionnaire  under  eight  experimental  conditions. 

It  was  found  that  Negro  subjects  tested  by  a Negro  examiner  reported  loss 
anxiety  than  those  tested  by  a white  examiner. 

Other  characteristics  of  administrators.  Ehrlich  and  Riesman  (1961) 
investigated  the  tendency  of  teen-aged  female  students  to  give  socially 
desirable  answers  to  authority  figures.  A socially  desirable  answer  was 
defined  as  showing  stronger  ties  to  parents  or  other  adults  than  to  peers. 


Less  socially  desirable  answers  were  more  often  given  to  younger  inter- 
viewers and  to  more  flexible  and  less  authoritative  interviewers  (as 
judged  by  personality  scores).  However,  for  interviewers  over  53,  person- 
ality did  not  make  a difference  as  people  over  53  were  seen  as  authority 
figures  regardless  of  their  personality  characteristics.  Respondents 
under  16  did  not  show  as  clear  a differential  by  age  of  interviewer  as 
did  those  between  16  and  18.  Sattler  (1970)  reported  that  the  greater  the 
disparity  between  the  status  of  the  interviewer  and  that  of  the  respondent, 
the  greater  the  tendency  for  biased  responses.  And  S legman.  Pope,  and  Blan 
(1969)  found  that  more  productive  responses  were  elicited  by  high  than  low 
status  interviewers. 

Atkin  and  Chaffee  (1972)  tested  the  ingratiating  effect  that  an  inter- 
viewer may  have  over  a subject.  In  one  study  residents  were  asked  about 
their  opinions  about  firefighters.  Half  of  the  subjects  were  told  their 
interviewers  were  firemen  while  the  other  half  believed  they  were  only 
students.  In  another  related  study  mothers  were  asked  their  opinions  of 
violence  on  TV.  Half  were  told  their  interviewers  were  on  a Federal  com- 
mittee investigating  TV  violence,  while  the  other  half  were  told  their 
interviewers  were  students.  The  results  showed  significant  differences 
between  the  two  groups  in  each  study,  which  suggests  a subject  will  try 
to  answer  favorably  in  the  eyes  of  the  interviewer,  if  the  subject  can  de- 
termine some  means  of  response  bias. 

Quinn  (1967)  examined  the  hypothesis  that  performance  raters  would 
tend  to  rate  subordinates  higher  who  were  most  like  themselves,  using  mili- 
tary officers.  Results  indicated  that  there  was  no  evidence  that  perform- 
ance ratings  were  influenced  by  similar  characteristics  of  rater  and  ratee. 
Johnson  (1958)  had  company  interviewers  rate  job  applicants,  and  the  appli- 
cants rate  the  interviewer.  He  concluded  that  personnel  selection  is 
largely  a matter  of  harmony  of  personal  characteristics  between  the  inter- 
viewer and  the  interviewee. 

Some  studies  dealt  with  the  experience  of  the  interviewer  related  to 
the  responses  received.  Smith  and  Hyman  (1950-51)  found  that  interviewers 
with  more  than  a year  of  experience  made  fewer  errors  in  recording  data 
than  those  with  no  interviewing  experience.  Schyberger  (1967)  reported  that 
the  results  of  a study  showed  nonsignificant  differences  between  interview 
completion  rates  for  experienced  and  inexperienced  interviewers,  and  that 
the  training  and  experience  of  the  interviewer  had  no  effect  on  the  number 
of  deviations  they  made  from  the  instructions.  Boyd  and  Westfall  (1965a) 
reported  that  all  interviewers  improved  with  experience,  and  training  im- 
proved interviewers  with  a high  school  education  but  had  little  effect  up- 
on interviewers  with  a college  education. 

More  research  needs  to  be  done  on  the  characteristics  of  the  admini- 
strators of  questionnaires.  No  studies  were  uncovered  related  specifically 
to  military  personnel.  For  example,  the  military  rank  of  the  person  ad- 
ministering a questionnaire  may  have  an  effect,  as  might  whether  the  ad- 
ministrator is  in  the  military  or  not. 


Effects  of  Administration  Conditions 


The  effects  of  questionnaire  administration  conditions  was  studied 
by  a number  of  authors.  In  a study  conducted  by  Hinrichs  and  Gatewood 
(1967),  male  technical  employees  in  a large  national  organization  rated 
their  degree  of  satisfaction  or  dissatisfaction  with  various  aspects  of 
their  work.  It  was  found  that  conditions  under  which  the  survey  was  ad- 
ministered did  have  an  effect  on  response.  When  employees  were  surveyed 
on  their  job  location  under  the  supervision  of  a company  representative, 
there  was  a tendency  to  respond  more  favorable  to  a significant  number  of 
general  opinion  questions,  particularly  questions  dealing  with  the  company 
in  general,  than  when  they  were  permitted  to  respond  to  a questionnaire 
mailed  to  their  home. 

Green  (1951)  determined  that  a larger  percentage  of  people  attempted 
to  fake  on  the  Kuder  Preference  Record,  the  Guilford  Inventory  of  Factors, 
and  the  Guilford-Martin  Inventory  of  Factors  when  these  tests  were  used 
for  serection  purposes  than  when  they  were  administered  to  a control  group. 
Rainio  (1956)  also  studied  the  effect  of  the  selection  situation  on  re- 
sponses to  questionnaires.  His  experiment  showed  a significant  difference 
between  research  and  selection  situations  for  various  traits.  In  the 
selection  situation  there  was  a trend  to  higher  scores  on  those  variables 
shown  to  have  higher  correlations  with  the  criterion.  Heron  (1956)  design- 
ed an  experiment  to  measure  the  effect  of  differences  in  test  conditions. 
Applicants  for  the  job  of  omnibus  conductor  were  given  a two  .part  person- 
ality test  covering  emotional  maladjustment  and  sociability.  In  oi;e  case 
it  was  administered  as  part  of  the  application  process  along  with  a health 
examination.  In  another  situation  the  test  was  administered  to  individuals 
after  they  had  been  hired.  There  was  a statistically  significant  differ- 
ence between  the  mean  scores  and  variances.  The  group  administered  the 
questionnaire  as  part  of  tVie  application  process  had  a higher  mean  score 
and  greater  variance. 

Several  of  the  studies  on  the  effects  of  questionnaire  administration 
conditions  were  concerned  with  raters  and  their  performance.  Bayroff, 
Haggerty,  and  Rundquist  (1954)  studied  the  validity  of  rating  related  to 
rating  techniques.  Officer  students  served  as  a rater-ratee  population 
using  two  types  of  graphic  rating  scales  and  two  modifications  of  the 
forced  choice  technique.  Results  indicated  that  ratings  earlier  in  a 
series  were  more  valid  than  those  at  the  end. 

Freeberg  (1969)  studied  the  relevance  of  rater-ratee  acquaintance 
in  the  validity  and  reliability  of  ratings.  Unacquainted  subjects  worked 
in  three-man  groups  under  relevant  and  irrelevant  acquaintance  conditions. 
The  subjects  rated  one  another  on  scales  that  defined  several  cognitive 
skills.  They  were  also  rated  on  these  same  scales  by  observers  who  were 
dependent  on  visual  information  only  and  were  unacquainted  with  the  group 
members  or  the  nature  of  the  task  being  performed.  Group  members  under 
the  relevant  acquaintance  condition  achieved  consistently  good  validity 
ratings  for  all  three  cognitive  areas,  with  the  best  validity  rating  on 
mathematical  ability.  Validity  under  the  irrelevant  acquaintance  condition 
was  nil  on  all  scales.  Observers  achieved  significant  validity  (although 
at  lower  levels  than  participating  group  members)  only  for  ratings  under 
the  relevant  acquaintance  condition. 


Shen  (1925)  found  that  when  28  subjects  ranked  each  other  on  friend- 
ship and  eight  other  traits  (intellectual  quickness,  intellectual  profound- 
ness, memory,  impulsiveness,  adaptability,  persistence,  leadership,  and 
scholarship)  there  was  a tendency  to  overestimate  friends  on  all  traits 
except  impulsiveness  and  to  underestimate  those  rated  as  less  intimate, 

Mayo  (1956)  concluded  that  for  peer  racing  there  is  a substantial  halo 
effect.  This  conclusion  was  based  on  a study  of  peer  ratings  of  intelli- 
gence and  effort  with  objective  measures  of  both. 

It  is  apparent  that  additional  research  is  needed  on  the  effect  of 
administration  conditions.  Such  research  should  include  the  study  of 
fatigue  factors. 


Effects  of  Other  Factors  Related  to  Questionnaire  Administration 

Many  things  affect  the  data  collected  by  questionnaires,  raters,  inter- 
views, and  observers.  This  section  discusses  those  uncovered  by  the  review 
of  the  literature:  investigator  bias;  observer  bias,  halo  effects,  and  the 

biasing  effect  of  interviews. 

Investigator  bias.  One  of  the  main  sources  of  bias  comes  from  the 
researcher  himself.  Kornhauser  (1947)  discussed  ,his  problem  of  bias  in 
research.  He  identified  several  biases:  choice  of  subject  matter;  study 

design  and  procedure;  unfair  or  loaded  phrasing  of  questions;  and  interpre- 
tation and  reporting  of  results.  He  felt  the  source  of  such  biases  are 
the  researcher's  relationship  with  the  client,  the  researcher's  personal 
involvement  in  a particular  theoretical  position  or  research  technique,  and 
those  personal  traits  attributable  to  class,  race,  and  political  ideology. 

To  reduce  the  impact  of  bias  he  felt  that  researchers  need  to  be  aware  of 
such  problems,  need  to  seek  critiques  from  independent  sources,  pursue 
public  scrutiny  through  publication  of  reports,  and  continue  to  pursue 
technical  improvement  in  opinion  research. 

Many  things  the  researcher  really  is  not  aware  of  have  an  influence  on 
results.  Jensen  and  Schmitt  (1970)  designed  a study  to  determine  the  extent  to 
which  responses  to  test  items  of  the  type  frequently  found  in  personality 
inventories  would  be  influenced  by  the  title  associated  with  the  test. 

An  instrument  was  constructed  and  administered  to  eight  treatment  groups. 

Each  administration  differed  primarily  in  the  title  the  test  bore.  The 
dependent  variables  were  measures  of  the  tendency  to  lie,  respond  defen- 
sively, answer  carefully,  and  complete  questions.  Subjects  tended  to  lie 
and  respond  more  defensively  to  titled  tests  than  to  a test  having  no 
title  and  administered  under  non threatening  conditions.  All  other  com- 
parisons were  not  statistically  significant. 

Dillehay  and  Jernigan  (1970)  tested  the  hypothesis  that  biased  ques- 
tionnaires are  effective  in  inducing  changes  in  the  subsequent  opinions  of 
respondents.  Systematically  biased  and  control  questionnaires  were  con- 
structed in  a manner  designed  to  elicit  either  harsh,  lenient,  or  neutral 
opinions  of  respondents  concerning  the  treatment  of  criminals.  After 
answering  one  form  of  these  treatment  questionnaires,  respondents 


registered  their  opinions  on  standardized  attitude  scales.  The  results 
indicated  that  the  treatment  questionnaires  were  successful  in  manipulating 
responses  to  lenient  bias.  Subjects  displayed  more  lenient  attitudes  after 
exposure  to  the  lenient  form  than  after  exposure  to  either  the  neutral  or 
harsh  forms  of  the  questionnaire. 

Question  bias.  Suchman  and  Guttman  (1947)  gave  four  suggestions  for 
eliminating  question  "bias:"  asking  many  questions  on  the  same  topic; 
determining  by  scale  analysis  whether  questions  ask  the  respondents  about 
the  same  dimensions  of  opinion;  asking  "How  strongly  do  you  feel  about 
this?"  after  each  opinion  question;  and  relating  the  content  of  opinion 
to  intensity  of  feeling. 

Observer  bias.  O'Leary  (1973)  and  Skindrud  (1972)  studied  observer 
bias  in  field  studies.  O'Leary  found  that  simply  informing  observers  of 
experimental  hypotheses  did  not  produce  observational  data  consonant  with 
those  hypotheses.  However,  questionnaire  responses  following  an  experiment 
with  different  induced  expectations  did  produce  global  data  consonant  with 
the  experimental  hypotheses.  He  also  found,  if  observers  are  informed  of 
the  experimental  hypotheses  and  the  investigator  provides  daily  feedback 
to  them  indicating  how  well  their  data  support  his  hypotheses,  observers 
will  report  data  consonant  with  those  hypotheses.  Skindrud  (1972)  led 
three  groups  of  observers  to  expect  different  outcomes  from  their  observa- 
tions. Even  though  the  groups  expected  different  outcomes,  they  were 
totally  unbiased  in  their  reports  of  deviant  behavior  in  group  comparisons. 
Failure  to  obtain  evidence  for  observer  bias  in  spite  of  the  demonstrated 
manipulation  of  observer  expectations  was  attributed  to  the  precautions 
taken  to  assure  high  levels  of  observer  accuracy. 

Halo  effects.  Several  studies  discussed  the  halo  effect,  which  is 
the  tendency  for  trait  ratings  to  reflect  in  part  the  rater's  general  im- 
pression of  the  person  he  is  rating.  Bingham  (1939)  reviewed  the  results 
from  two  examining  boards  responsible  for  rating  29  candidates  for 
executive  director  positions  in  two  Pennsylvania  counties.  He  found  the 
correlations  between  rating  for  the  general  category  "Personal  Fitness" 
and  the  ratings  for  specific  traits  such  as  voice,  poise,  freedom  from 
bias,  and  ability  to  plan  and  organize  to  be  positive  and  rather  high. 
Johnson  and  Vidulich  (1956)  found  that  halo  effect  is  a judgmental  error 
rather  than  the  effect  of  an  objective  correlation  of  traits.  In  their 
study  one  group  rated  five  individuals,  one  individual  per  day  on  five 
traits,  while  another  group  rated  five  individuals  on  one  trait  per  day. 
Johnson  (1963)  reanalyzed  the  data  and  found  that  the  usual  interaction 
between  raters  and  individuals  was  found  to  be  significant  under  both 
experimental  conditions.  Hence  he  concluded  that  the  evidence  for  halo 
effect  due  to  judging  operations  remains  questionable.  Bucklow  (1960) 
concluded  that,  if  items  are  constructea  so  "as  to  relate  to  clearly  ob- 
servable aspects  of  behavior  which  do  not  overlap,"  rating  will  be  improved, 
although  "halo"  cannot  be  eliminated. 

In  a study  conducted  by  Gordon  (1972),  the  results  indicated  that 
neither  the  differential  accuracy  phenomenon  (the  situation  where  correct 
behavior  is  identified  more  accurately  than  incorrect  behavior)  nor  the 
overall  accuracy  of  ratings  were  related  to  the  favorability  of  the 


rater's  general  impression  of  the  ratee.  He  concluded  that  these  findings 
make  suspect  the  current  practice  of  operationalizing  leniency  error  by 
use  of  the  average  level  of  favorability  of  global  rating.  Bayroff, 

Haggerty,  and  Rundquist  (1954)  found  that  the  average  of  a number  of 
ratings  was  more  valid  than  a single  rating  per  ratee.  Rappard  (1950) 
also  found  that  mutual  arrangement  between  a number  of  raters  is  felt  to 
enhance  greatly  the  correctness  of  the  rating. 

Zavalloni  and  Cook  (1965)  concluded  that  ratings  of  unfavorable  as  well  as 
neutral  items  are  influenced  by  raters'  attitudes.  Extreme  judges  make 
fine  discriminations  at  their  own  end  of  the  scale  and  lump  together  the 
items  at  the  other  end.  Falk  and  Bayroff  (1954)  concluded  that  the  rater 
is  the  principal  source  of  contamination  in  studies  using  ratings. 


Biasing  effect  of  interviews.  Many  studies  have  been  conducted  to 
show  the  biasing  effect  of  the  interview.  Since  unstructured  interviewing 
is  not  within  the  focus  of  this  review,  only  a few  of  these  studies  are 


discussed  below  to  indicate  the  scope  of  the  bias  and  some  of  the  recom- 


mendations for  controlling  it. 


In  a study  by  Stanton  and  Baker  (1942),  five  professionally  trained 
interviewers  obtained  significantly  more  correct  recognitions  of  previous- 
ly exposed  geometric  figures  when  they  knew  the  identity  of  the  correct 
figures  than  when  they  did  not.  In  contrast,  in  a study  by  Lindzey  (1951) 
graduate  students  with  training  in  interviewing  methods  failed  to  obtain 
significantly  more  correct  recognition  of  previously  exposed  geometric 
figures  when  they  knew  the  identities  of  the  correct  figures  than  when 
they  did  not. 

Hanson  and  Marks (1958)  reported  that  the  factors  leading  to  signifi- 
cant effects  of  the  interviewer  upon  results  are;  relatively  high  ambigu- 
ity in  the  concept  or  wording  of  the  inquiry;  the  interviewer  "resistance" 
to  a given  question;  and  additional  questioning  or  probing.  Ferber  and 
Wales  (1952)  reported  that  interviewer  bias  could  exist  without  being 
apparent  in  an  analysis  of  overall  sample  distributions.  The  direction 
of  bias  did  not  appear  to  be  uniform.  Cahalan,  Tamulonis,  and  Verner 
(1947)  concluded  that  the  least  interviewer  bias  was  found  in  questions 
that  could  be  answered  "Yes"  or  "No".  Shapiro  and  Eberhart  (1947)  report- 
ed that  interviewer  bias  cn  attitude  questions  resulted  from  differences 
in  the  interviewing  method  used,  differences  in  the  degree  of  success  in 
eliciting  factual  information,  and  differences  in  classifying  the  respond- 
ent's answers. 


Back,  Hill,  and  Stycos  (1955),  by  analyzing  the  data  reported  from 
interviews  in  a fertility  program  in  Puerto  Rico,  found  reproducibility 
differences  which  were  attributed  to  the  interviewer  and  not  to  a response 
set  of  the  respondents  to  four  Guttman  scales.  Two  "traits"  were  found 
among  the  interviewers  which  were  negatively  correlated:  conscientiously 

completing  the  questionnaire,  and  understanding  the  study.  The  resulting 
effect  is  either  a quality  interviewer  or  a quantity  interviewer,  which 
should  be  decided  by  the  type  of  data  needed.  Smith  and  Hyman  (1950-51) 
concluded  that  interviewer  expectations  had  a more  powerful  effect  on 
the  results  (recording  errors)  than  did  the  interviewer's  ideological 
preferences . 


*4- 


Chapter  X 


CHARACTERISTICS  OF  RESPONDENTS  THAT  INFLUENCE 
QUESTIONNAIRE  RESULTS 


This  chapter  discusses  various  types  of  response  bias.  Response 
bias  refers  to  the  tendency  of  subjects  to  respond  to  questions  in  a 
pattern  or  set  regardless  of  the  content  of  the  question.  One  hundred 
thirty-seven  studies  were  searched  indicating  that  the  subject  is  rec- 
ognized as  an  important  aspect  in  questionnai  chnology.  These 

studies  are  discussed  in  terms  of:  item  forni  i.ases;  social  desira- 

bility response  set;  acquiescence  response  set,  extreme  response  set; 
the  effects  of  attitudes  on  responses;  and  the  effects  of  demographic 
characteristics  on  responses. 

Cronbach  (1950)  and  Horn  and  Cattell  (19b5)  examined  the  disturbing 
effect  of  response  bias  on  test  reliability  and  validity.  Fricke  (1957) 
asserted  that  response  bias  could  explain  repeated  findings  that  well- 
adjusted,  successful  persons  obtain  more  abnormal  scores  on  the  subtle 
scales  of  the  MMPI  than  maladjusted,  unsuccessful  persons.  Rorer  (1965), 
however,  ccocluded  that  if  a bias  to  consistently  respond  in  a particular 
way  exists,  it  would  be  eliminated  by  rewording  the  questions  in  the 
opposite  direction.  His  results  indicated  that  response  bias  could  be 
attributed  simply  to  the  content  of  the  question,  th<.refore,  as  an  inter- 
vening variable,  would  have  only  minor  influence. 

Most  of  the  research  on  response  bias  does  substantiate  its  existence. 
Nunnally  and  Husek  (1958)  demonstrated  response  bias  by  substituting  randor’ 
chosen  foreign  words  for  meaningful  components  of  test  items  and  then 
measuring  the  predisposition  of  subjects  to  give  particular  answers  to 
these  ambiguous  questions.  In  a similar  study  McCord  (1951)  designed 
questions  that  could  not  be  answered  factually  or  truthfully  by  saying 
"yes,”  yet  he  found  between  87o  and  537o  affirmative  responses.  Berg  and 
Rapaport  (1959)  eliminated  the  questions  altogether  and  had  their  subjects 
answer  imagined  questions;  they  found  a great  tendency  among  their  re- 
spondents to  choose  culturally  valued  express  ons  such  as  "yes,"  "true," 
and 'agree."  Other  researchers  such  as  Webster  (1960)  have  found  high 
correlations  between  response  patterns  on  personality  inventories  and 
personality  measures  like  social  alienation  and  schizoid  functioning. 

Sudman  and  Bradburn  (1974)  have  examined  many  possible  sources  for  re- 
sponse bias  and  the  effect  this  variable  has  on  error  in  research.  The 
remainder  of  this  chapter  will  describe  the  studies  done  during  the  last 
twenty-five  years  in  identifying  the  possible  sources  of  response  bias. 


I tern  Format  Biases 


It  has  been  shown  that  response  bias  is  related  to  the  format  of  the 
question  and  the  methods  of  response  available.  Cronbach  (1946)  and 
Miklich  (1966)  have  demonstratea  how  item  ambiguity  produces  a recognizable 
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pattern  of  response.  Tajfel  (1959)  found  that  even  abstract  stimuli 
influence  the  physical  attributes  of  items  to  be  judged  when  no  other 
information  is  available.  Zajonc  and  Nienwenhyse  (1964),  however,  found 
that  the  frequency  of  common  words  as  answers  played  only  a negligible 
role  in  response  bias.  Sax  and  Carr  (1962),  working  with  omnibus  and 
subdivided  test  formats,  and  O'Dell  (1962),  studying  the  sequence  of  items, 
concluded  that  these  format  variables  were  significantly  correlated  with 
response  bias. 

Aschal  (1958)  concluded  that  response  bias  can  be  expected  from  all 
close-ended  questionnaires  where  answers  must  be  selected  from  two  or 
more  fixed  choices.  Jackson  and  Minton  (1963)  found  that  the  forced 
choice  format  of  selecting  one  of  a pair  of  choices  could  eliminate  a 
massive  response  set  where  some  respondents  tend  to  check  many  items  from 
a list  and  others  only  a few.  Considering  true-false,  multiple  choice 
and  card  sorting  methods.  Van  Der  Veen,  Howard,  and  Austria  (1970)  concluded 
that  all  three  formats  were  relatively  free  of  response  bias;  but,  like 
Cataldo,  Johnson,  and  Kellstedt  (1970),  they  found  card  sorting  to  show 
the  least  effects  of  format  response  bias.  One  additional  response  bias, 
that  of  random  markings,  was  noted  by  Flanagan  (1955),  and  seemed  to  be 
related  to  the  motivation  of  the  subjects  when  they  had  no  reason  to  take 
the  test. 


Social  Desirability  Response  Set 

Evidence  has  been  found  that  a response  set  or  style  exists  according 
to  how  favorably  society  would  view  the  response.  This  type  of  social 
desirability  response  was  found  by  Rugg  and  Cantril  (1942)  to  be  so  power- 
ful that  subjects  would  not  tend  to  deviate  from  social  norms  in  their 
answers  even  though  their  behavior  denied  the  opinion.  Warren  (1972) 
successfully  trained  some  subjects  to  a particular  response  set  but  found 
that  highly  socially  desirable  items  prevented  facilitation.  In  an 
attempt  to  further  define  this  factor  of  response  bias,  Fehrer  and  Strupp 
(1949)  determined  that  prestige  value  had  no  effect  on  responses  to  job 
title  preferences,  and  Krug  and  Northrup  (1959)  noted  that  on  self -descr ip- 
tion  inventories  response  time  decreased  as  social  acceptability  increased. 

The  influence  of  social  desirability  was  noted  by  French  (1958)  in 
scaling  instructions  that  included  the  phrase  "the  Air  Force  way."  When 
a respondent's  job  (Green,  1951)  or  his  incarceration  (Dubeck  et  al,)1971) 
depended  upon  his  answers,  there  was  a great  tendency  toward  socially 
desirable  responses.  Wiseman  (1972)  found  that  anonymous  questionnaires 
as  opposed  to  personal  interviews  were  necessary  in  order  to  surmount  the 
social  desirability  response  set  on  such  socially  sensitive  issues  as 
abortion  or  contraceptives.  Heilbrun  (1958)  noted  that  under  defensive 
conditions,  subjects  avoided  unfavorable  self-descriptive  adjectives  but 
did  not  necessarily  Increase  selection  of  favorable  adjectives. 

Several  authors  (Edwards  & Diers,  1963;  Dixon,  1970;  Potter  & 
Tinkleman,  1970;  Eysenck  & Eysenck,  1963;  Brod , Kernoff  & Terwillinger 
1964)  have  identified  subjects  with  a high  social  desirability  response 


rate.  They  found  these  respondents  to  give  more  true  responses  to  neutral 
items,  to  be  more  susceptible  to  manipulation  by  social  pressure,  to  more 
likely  be  introverts,  and  to  score  higher  on  a "lie"  scale.  Buss  (1959) 
found  that  this  response  set  was  elevated  with  some  subjects  when  given 
response  choices  styled  like  "trouble  controlling,"  "must  admit,"  and 
"tempted . " 

Faking  or  responding  with  socially  desirable  answers  which  are  not 
true  is  a response  error.  Izard  and  Rosenberg  (1958)  gave  instructions 
to  their  treatment  group  to  try  to  fake  their  answers  but  found  no  signi- 
ficant differences  between  those  and  the  control  group's  in  a forced  choice 
test.  Several  other  authors  (leftwich  & Remmers,  1962;  Eisenberg,  1965; 
Bartlett  & Doorley,  1967),  however,  obtained  significant  results  showing 
fakability  present  in  forced  choice  tests  under  varying  instructional  sets. 
Jones  (1959)  tried  to  neutralize  faking  by  instructing  subjects  to  do  so 
and  then  establishing  correlations  of  reliability  with  other  tests.  However, 
he  was  unable  to  achieve  high  correlations.  Cliff  (1968a) determined  that 
faking  responses  as  well  as  candid  ones  were  simple  functions  of  meaning 
space  due  to  the  great  unanimity  among  the  subjects  concerning  how  to  fake. 
Edwards  (1957a)noted  that  even  anonymity  failed  to  eliminate  the  social 
desirability  response  set. 

The  forced  choice  instrument  format  has  been  studied  for  its  suscep- 
tibility to  social  desirability  response  style.  Silverman  (1957)  and 
Karr  (19593) found  the  forced  choice  method  to  minimize  the  effect  of 
social  desirability,  while  Krug  (1958),  Howe  (1960),  and  Bernhardson  and 
Fisher  (1971)  found  the  factor  needing  control  in  forced  choice  tests. 

Isard  (1956)  concluded  that  in  forced  choice  formats,  ambiguous  items  tend- 
ed to  be  freer  of  social  desirability  response  set  than  positively  or 
negatively  worded  items.  Due  to  the  freer  response  choice  of  card  sorting, 
Edwards  and  Horst  (1953)  and  Edwards  (1955)  examined  the  method  for  social 
desirability  effect  but  found  it,  too,  needed  controls  to  eliminate  the 
bias.  Hillmer  (1958)  found  this  response  set  to  operate  whenever  the  sub- 
ject had  the  opportunity  to  respond  in  terms  of  it. 

Krieger  (1964)  and  Smith  (1967)  have  both  developed  procedures  for 
controlling  or  balancing  social  desirability  by  using  loaded  items  in  the 
test  and  then  adjusting  the  subject's  score. 


Acquiescence  Response  Set 


The  response  set  to  consistently  agree,  to  say  "yes,"  or  to  say  "true," 
is  called  acquiescence.  Upon  comparing  subjects  taking  attitude  measuring 
tests,  Lorge  (1937)  found  correlations  among  those  who  marked  "yes,"  "like," 
and  "1"  or  "2"  as  well  as  correlations  for  those  marking  "no,"  dislike,' 
and  "7"  or  "8."  Shipley,  Norris  and  Roberts  (1946)  noted  that  judgement 
time  was  decreased  when  subjects  were  to  choose  the  most  pleasant  color 
from  many  pleasant  colors,  or  the  most  unpleasant  one  from  many  unpleasant 
ones.  He  concluded  that  this  was  an  indication  of  an  acquiescent  response 
set.  Other  authors  (Jackson  & Messick,  1957;  Mahler,  1962;  Eysenck  & 
Eysenck,  1963a ; Foster , 1961;  Quinn,  1963)  have  identified  the  acquiescence 
response  set  as  a behavioral  attitude  to  agree  and  accept  even  if  subjects 
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must  alter  their  original  opinions  to  do  so.  Elliot  (1961)  determined 
that  acquiescence  was  highly  dependent  upon  test  construction  and  the  re- 
spondent's aptitude,  while  Hanley  (1965)  found  that  acquiescence  occurred 
with  difficult  ratlier  than  easy  inventory  material. 

Gage  and  Chatterjee  (i960)  and  Diers(1961)  have  further  concluded 
that  there  is  an  opposing  response  set  --  the  naysayer  --  which  they  found 
more  valid  than  yeasayers.  Wells  (1963)  identified  both  yeasayers  and  nay- 
sayers, but  found  more  distortion  in  survey  findings  due  to  the  former. 


( 


A still  unsettled  argument  is  whether  or  not  acquiescence  is  simply 
a personality  trait.  Couch  and  Keniston  (1960),  Frederiksen  and  Messick 
(1958),  Adams  (1956)^  and  Becker  and  Myers  (1970)- all  pointed  to  the  cor- 
relations of  personality  factors  with  acquiescence.  In  contrast,  Foster 
and  Grigg  (1963),  Eysenck  (1562),  and  Findikyan  (1969)  found  acquiescence 
unrelated  to  personality  factors  in  personality  measures,  but  conceded 
there  may  be  a relationship  in  sociopolitical  opinions  or  attitudes.  To 
confound  the  matter,  Cohn  (1956)  contended  that  the  F scale  is  contaminated 
by  an  acquiescence  response  set,  while  Small  and  Campbell  (1960)  asserted 
that  the  relationship  between  conformity  and  the  F scale  is  a function  of 
content  and  not  acquiescence. 

Controls  for  acquiescence  have  been  researched  and  some  information 
is  available  on  the  response  set's  effe-'t.  Wells  (1961)  has  detailed 
several  design  and  statistical  analysis  procedures  for  eliminating  the 
effect.  Clancy  and  Garsen  (1970)  found  that  absolute  scales  of  appeal 
were  distorted  by  yea-  and  naysaying  effects.  Banta  (1961)  and  Cloud  and 
Vaughn  (1970)  concluded  that  item  ambiguity  increases  an  acquiescent 
tendency,  but  that,  when  it  is  minimized,  balanced  keying  of  items  prevents 
contamination.  Campbell,  Siegman  & Rees  d967)  found  that  posit  Ive-negat  iA’e  rev- 
ersal of  items  did  not  entirely  eliminate  the  problem,  but  Findikyan  (1969)  con- 
cluded that  reversal  is  an  effective  control  if  the  items  are  not  awkward- 
ly worded.  Falthzik  and  Jolson  (1974)  determined  that  a higher  intensity 
of  agreement  is  reached  when  items  are  positively  stated  than  negatively 
stated.  Knowles  (1963),  while  finding  the  balancing  of  scales  of  dubious 
value  tc  counteract  acquiescence,  did  demonstrate  that  true-false  question- 
naires can  be  differentially  prone  to  acquiescent  response  set. 

There  is  a concern  that  social  desirability  and  acquiescence  are  re- 
lated in  such  a way  that  an  individual  with  a tendency  toward  conformity 
will  consistently  reflect  both  biases.  Several  authors  (Schultz,  1962; 

Strieker,  1962,  1963;  Gloye,  1964;  Liberty,  1965)  have  studied  the  relation- 
ship of  the  two  effects  and  found  no  correlation.  In  two  additional  inves- 
tigations, the  two  variables  were  studied  but  only  one  of  them  could  be 
established  to  exist  independently;  Siller  and  Chipman  (1963)  found  an 
acquiescent  response  set  factor,  and  Winters  and  Bartlett  (1966)  found  only 
an  independent  social  desirability  response  set. 

Extreme  Response  Set 

Several  studies  have  examined  the  possibility  that  an  extreme  response 
set  exists  where  some  individuals  tend  to  consistently  select  exaggerated 
choices  for  positions.  Rundquist  (1950)  found  a low  but  significant 
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correlation  for  preferred  personality  and  interest  items.  Rather  than 
use  it  as  a predictive  instrument,  the  author  suggested  attempts  be  made 
to  eliminate  the  effect.  Goldsamt  (1972)  found  evidence  that  an  extreme 
response  style  did  exist  on  neutral  content  items,  but  that  the  effect  was 
not  generalizable  due  to  its  interaction  with  content-extreme-response- 
style.  Mascaro  (1968)  examined  the  width  of  the  response  categories  and 
extreme  response,  but  found  no  significant  correlation. 

An  unnecessary  assumption  is  often  made  that  a "don't  know"  or 
middle  category  response  is  lacking  in  demonstrative  extremeness.  Worthy 
(1969)  and  Ziller  and  Long  (1965)  presented  evidence  that  this  response  is 
valid  and  can  be  related  to  dogmatism  as  a status-defense  mechanism.  Two 
studies  (Adams,  1956;  Cooper  & Cowen,  1962)  pointed  out  that  extreme  re- 
sponses are  not  necessarily  inhibited,  and  that  a lack  of  inhibition  would 
not  explain  the  bias  pattern. 

Soueif  (1958)  found  a positive  correlation  between  extreme  response 
style  and  intolerance  to  ambiguity.  Levy-Leboyer  (1955b)  , however , found 
that  subjects  who  consistently  omitted  items  were  affected  more  by  motiva- 
tion than  by  a persisting  psychological  trait.  Lucky  and  Grigg  (1964) 
examined  defensiveness  and  deviant  responses,  but  concluded  that  outside 
self -description  the  two  variables  were  unrelated. 


Effects  of  Attitudes  on  Responses 

A response  bias  attributed  to  an  attitude  is  one  which  is  influenced 
by  the  respondent's  opinion,  belief,  or  position.  Shen  (1925)  recognized 
the  disturbing  influence  that  acquaintance  had  on  raters  and  ratees. 
Hinckley(1932a) ,Prothro  (1955),  and  Ferguson  (1935a)found  that  by  using 
Thurstone's  methods  of  equal  interval  scaling,  judges  could  rate  items 
without  being  influenced  by  their  own  attitudes.  Bruvold  (1971),  however, 
confirmed  a competing  hypothesis,  that  anti-attitude  judges  would  rate  un- 
favorable items  higher  and  favorable  items  lower  than  pro-attitude  judges. 
Other  similarly  disturbing  attitude  effects  in  response  were  found  by 
Prothro  (1957)  concerning  personal  involvement  of  judges,  and  by  Mogar 
(1960)  involving  high  authoritarians  in  controversial  social  issues. 

Explanations  for  these  contrasting  research  conclusions  came  generally 
as  controls  developed.  Kendall  (1954)  found  unstable  or  changing  responses 
were  contributed  to  by  shifts  in  the  mood  of  the  respondent,  relative 
values  among  the  possible  choices,  and  the  degree  of  interest  present  in 
the  question.  Kelley,  Hovland,  Schwartz,  and  Abelson  (1955)  found  that 
blacks  and  whitCo  in  a competitive  situation  would  make  similar  judgements 
concerning  the  social  position  of  Negroes  but  when  separated,  blacks  tend- 
ed toward  extreme  responses.  Zimbardo  (1960)  found  no  differences  between 
pro-  and  anti-judges  when  well-structured  sentences  were  used;  but,  as  they 
became  more  ambiguous,  the  responses  became  more  attitudinally  biased. 
Upshaw  (1962)  noted  that,  if  the  judge's  own  position  was  outside  the 
range  of  responses,  bias  would  be  evident.  Ramlo  (1968)  was  able  to  shift 
judges'  responses  through  attitudinal  bias  by  instructing  them  to  disregard 
their  own  opinions,  which  they  could  not  do. 
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Other  factors  evidenced  in  attitudinal  bias  research  are  ego  involve- 
ment (Pauli,  19b8) , prestige  value  (Doncel,  Alimena,  and  Birch,  19A9)  , 
lack  of  attitude  or  position  (Georgoff,  Ifersker,  and  Murdick,  1972),  and 
task  avoidance  behavior  (Weitman,  1964).  In  addition.  Hill  (1953)  noted 
that  inconsistent  judgements  decreased  as  psychological  distances  between 
items  increased.  French  (1958)  suggests  that  preference  schedules  be 
rescaled  when  attitudinal  norms  of  groups  differ  greatly  from  the  standard. 

Effects  of  Demographic  Characteristics  on  Responses 

The  final  general  area  contributing  to  response  bias  is  the  effect  of 
demographic  characteristics.  Such  identifications  as  sex,  age,  race,  or 
education  have  been  examined  to  determine  if  similarities  of  such  variables 
among  respondents  tend  to  bo  related  to  a response  pattern.  Roslow  and 
Blankenship  (1939)  pointed  out  the  theoretical  need  for  designing  questions 
with  the  respondent's  background  in  mind.  Johnson  (1958)  found  that  in 
personnel  selection  , the  harmony  of  demographic  characteristics  played  a 
major  role  in  interviewer-interviewee  relations.  Schaie  (1962)  noted  that 
factor  matching  in  analysis  depended  on  demographic  knowledge  of  raters 
and  responders.  Jury  (1971)  determined  that  demographic  characteristics 
reflected  differences  in  workers'  views  of  organizational-type  variables, 
but  that  they  were  not  related  to  i.'.dividua  1- type  variables. 

Socioeconomic  class  has  been  identified  by  Soueif  (1958)  and  by 
Clancy  and  Garsen  (1970)  as  an  influence  of  bias  in  response  patterns. 

Sgan  (1967)  found  middle  class  children  to  bo  more  susceptible  to  experi- 
menters' influence  than  lower  class  children.  Race  was  found  to  be  an 
identifying  factor  in  extreme  response  rates  by  Sherif  and  Hovland  (1953), 
while  Sattler  (1970)  noted  that  response  bias  increased  in  interviews  is 
racial  disparity  grew.  Another  characteristic  bias,  level  of  education, 
has  been  shown  to  relate  to  a decrease  in  acquiescent  response  style 
(Falthzik  and  Jolson,  1974),  and  to  an  increase  in  nonacceptance  of  causal 
explanations  in  ambiguous  situations  (Nunnally  & Husek,  1958). 

Several  authors  have  identified  other  demographic  characteristic 
variables  such  as  age,  religion,  intelligence,  sex,  marital  status,  parent- 
hood, nationality,  urban  or  rural  residence,  income,  rank,  and  experience. 
Such  variables  have  been  correlated  with  biases  found  in  response  con.sis- 
tency  (Hart,  Faust,  Rowland  & Lucier,  1964;  Dakin  & Tennant,  1968;  Goldsamt, 
1972;  Flyer  & Carp,  1962;  and  Sicinski,  1970).  Aschal  (1958)  and  Wells 
(1963)  found  correlations  between  acquiescence  and  demographic  variables. 
Quinn  (1967)  in  a study  contrasting  previous  research,  found  no  relation- 
ship between  several  demographic  characteristics  of  raters  and  ratees  and 
their  ratings.  Bauer  (1947),  Ferber  (1966),  and  Ognibene  (1973)  found 
common  characteristics  of  youth  and  less  education  among  nonrespondents  in 
mail  surveys. 

Other  studies  have  explored  more  removed  characteristics  searching 
for  significant  differences  in  responses.  Bayroff,  Haggerty,  and  Rundquist 
(1954)  examined  "hard  raters"  and  "easy  raters"  but  found  no  differences 


in  validity.  High  and  low  "feeling"  persons  were  found  by  Frisbie  and 
Sudman  (1968)  to  make  long  answers  on  open-ended  questionnaires.  Ferber 
(1956)  suggested  that  survey  researctiers  determine  the  state  of  knowledge 
of  their  sample  to  avoid  a response  bias  by  persons  ignorant  of  the  issues 
and  by  persons  misinformed  about  the  issues.  Two  studies  of  military 
personnel  (Hollis,  1954;  Gilbert,  1956)  indicated  tiiat  the  influence  of 
occupational  environment  may  be  related  to  a bias  against  criticism  and 
for  acquiescence. 


Summary  and  Conclusions 

Response  bias  is  an  error  factor  in  questionnaire  technology  due  to 
a pattern  of  answers  made  by  the  respondent  that  appear  to  be  related  to 
extraneous  variables.  Several  areas  of  origin  for  response  bias  have  been 
studied  which  are  grouped  in  this  chapter  into  six  categories. 

1.  Format  biases  are  responses  influenced  by  the  question  stem  or  re- 
sponse alternatives.  Sequence  and  fixed  choice  responses  have 
been  related  to  this  bias. 

2.  Social  desirability  has  been  well  identified  as  a response  set 
where  persons  answer  according  to  the  norms  they  believe  society 
condones.  The  faking  of  responses  on  questionnaires  contaminates 
the  results,  and  controls  must  be  designed  to  prevent  its  operation. 

3.  Acquiescence  is  the  bias  demonstrated  by  yeasayers  who  tend  to  re- 
spond more  often  agreeably  than  disagreeably.  Some  dispute  remains 
over  this  bias  as  to  whether  it  is  actually  a personality  trait. 

4.  The  extreme  response  set  refers  to  the  pattern  of  answers  persons 
make  which  tend  to  be  unevenly  distributed  toward  one  or  both  poles. 

As  with  acquiescence,  some  research  indicated  that  this  response 
style  may  also  be  a personality  description. 

5 Attitudes  may  influence  responses  in  identifiable  patterns.  Opin- 
ions ; ;.u  beliefs  seem  to  be  related  to  a response  bias. 

6.  Demographic  characteristics  have  been  shown  to  be  related  to  re- 
sponse bias  Education,  age,  social  class,  etc.,  have  been  found 
influential  in  a response  pattern  especially  noted  by  consistency. 

Research  during  the  last  twenty-five  years  establishes  a very  strong 
case  for  the  existence  of  response  bias.  Studies  documenting  its  origins 
in  social  desirability,  questionnaire  format,  and  demographic  character- 
istics are  numerous.  More  evidence  is  needed  to  confirm  that  acquiescence, 
extreme  response  set,  and  attitudes  are  actually  biases  and  not  personality 
traits.  None  of  the  control  measures  examined  thus  far,  including  changing 
wording  direction,  balancing  scales,  using  card  sorts,  forced  choice,  or 
open-end, ?d  designs,  or  loading  questions,  have  convincingly  eliminated  re- 
sponse bias.  Kore  detailed  identification  and  control  methods  are  areas 
of  needed  further  research  in  response  bias. 
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Chapter  XI 


CONSIDERATIONS  RELATED  TO  THE  EVALUATION 
OE  QUESTIONNAIRE  RESULTS 


Considerations  related  to  the  evaluation  ol  questionnaire  results 
was  another  area  not  stressed  durin;'  the  literature  review.  Some  articles 
were  reviewed,  however,  that  pertain  to  the  scoring  of  questionnaire 
results,  the  properties  and  uses  of  ipsative  scores,  and  data  analyses. 


Scoring  of  Questionnaire  Results 

Practical  considerations.  Erdos  noted  (1948b)  some  points  that  are 
often  forgotten  until  too  late:  that  both  time  and  money  can  be  saved  by 

planning  the  questionnaire  in  line  with  tabulation  requirements.  He  used 
sample  questions  to  illustrate  the  relationship  between  order  of  questions 
and  tabulation,  and  how  phrasing  of  questions,  sequence,  and  layout  can 
affect  tabulation  time.  He  also  pointed  out  that  whether  data  are  to  be 
tabulated  by  hand  or  by  machine  is  an  important  decision  and  should  be 
made  in  advance.  The  precoding  of  responses  whenever  possible  was  also 
recommended. 

Quite  early,  Bass  and  Wurster  (1956)  described  the  use  of  IBM  Mark  Sense 
cards  to  put  data  on  punched  cards.  They  noted  the  procedure  avoids  the 
expense  and  difficulty  of  coding  and  keypunching  large  volumes  of  raw  data. 
Of  course,  the  use  of  Mark  Sense  cards  has  been  largely  replaced  by  one 
of  a number  ol  optical  scanning  procedures  allowing  the  processing  of 
regular  sized  answer  sheets  and  booklets. 

Lyman  (1949)  examined  the  assumption  that  items  in  a multi-scale 
inventory  should  be  scrambled,  even  when  the  items  are  "obvious."  He 
compared  scrambled  items  and  items  blocked  according  to  scale  in  a school’ 
attitude  survey.  Two  high  school  senior  classes  were  given  the  tests, 
one  half  of  each  taking  the  alternate  version  which  was  followed  two 
weeks  later  by  the  other  version.  Test  scores  revealed  no  statistical  Iv 
significant  differences,  loading  the  author  to  conclude  that  blocking 
items  may  be  preferable  due  to  its  greater  ease  of  scoring. 

Other  considerations  related  to  scoring  questionnaires.  Methods  of 
scoring  questionnaires,  especially  attitude  scales,  were  discussed  by  a 
number  of  authors.  For  example,  Kundu  (19t0)  suggested  a method  for 
scoring  responses  on  thr^e  point  attitude  scales.  Assuming  a non-normal 
distribution  of  attitude  scores  and  non-neutral  trends  in  attitudes,  the 
"neutral"  responses  are  broken  into  positive  or  negative  and  the  responses 
are  scored  with  the  help  of  average  group  trends  and  weighted  scores  of 
the  individual  responses.  Peabody  reported  (1962)  that  there  is  a 
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justification  for  scoring  items  dichol  omously  according  lc>  tlie  direction 
of  response,  as  is  done  when  bipolar  scales  are  analyzed  in  terms  of  the 
proportion  of  responses  in  each  direction  ol  the  basic  dichotomy.  The 
justification  is  based  upon  results  he  obtained  that  indicated  composite 
scores  reflect  primarily  the  direction  of  responses,  and  only  to  a minor 
extent  their  extremeness.  (He  also  noted  that  since  extremeness  scores 
are  reliable  and  are  largely  unreflected  in  the  usual  composite  score, 
they  may  have  quite  different  correlates  of  their  own.)  Matell  (1970), 
who  investigated  the  psychometric  characteristics  of  Likert-type  rating 
scales  consisting  ol  two  through  19  steps,  found  .that, by  collapsing  the 
steps  into  two  or  three  measurement  categories  for  analysis  ,no  lack  of 
precision  resulted,  Schuessler  (1952),  however,  raised  doubt  about  the 
validity  of  combining  response  categories  in  successive  approximations  of 
scalability.  He  showed  irregularities  between  analyses  of  a questionnaire 
form  in  which  an  "uncertain"  response  was  permitted  and  combined  as  an 
approximation,  and  a second  questionnaire  form  in  which  the  "uncertain" 
response  was  not  permitted.  Odesky  (1967),  working  with  paired  compari- 
sons with  a no  preference  option, suggested  the  advisability  of  either 
dividing  no  preference  responses  proportionate  to  preference  responses, 
or  disregarding  them  altogether.  The  basis  of  this  suggestion  was  that 
respondents  who  claim  neutrality  appear  to  exhibit  the  same  preference 
patterns  as  those  who  express  a preference. 

Two  methods  of  scoring  rating  scale  data  to  approximate  forced 
choice  results  were  reported  by  Karr  (1959a;  1959b).  One  was  called  the 
difference  method,  and  was  designed  to  have  maximum  stability.  The  other, 
called  the  zero-one  method,  was  designed  to  match  as  closely  as  possible 
the  method  of  scoring  for  the  forced  choice  format  personality  inventory 
with  which  the  scoring  methods  were  being  compared.  Karr  concluded  that 
by  using  any  one  of  several  methods  of  scoring  or  transforming  self-rating 
scale  raw  scores,  it  is  possible  to  approximate  dyadic  forced  choice 
lesults  with  considerable  saving  in  administration  time,  and  a small  gain 
in  test-retest  reliability. 

It  was  hypothesized  by  Schaie  (1963)  that  the  concurrent  validity  of 
questionnaires  can  be  increased  by  the  use  of  item  weights  obtained  by 
expert  scaling  , instead  of  by  using  conventional  unit  weights.  The  results, 
using  a high  school  personality  quiz,  showed  only  low  magnitude  increments 
in  validity,  however. 


Several  authors  reported  on  the  use  of  intensity  scores  as  distinguished 
from  content  scores.  Guttman  ( 1947 a)showed  how  intensity  scores  can  be 
obtained  by  either  the  fold-over  technique  or  the  two-part  technique.  The 
fold-over  technique  involved  weighting  extreme  responses  (positive  and 
negative)  as  2,  moderate  responses  as  1,  and  neutral  responses  as  0,  and 
summing  these  for  an  intensity  score.  The  two-part  technique  ascertains 
an  intensity  score  simply  by  following  each  question  with  the  query  "How 
strongly  do  you  feel  about  this?"  which  is  also  answered  on  a scale. 

Goldsamt  (1972),  working  with  content-free  stimuli,  however,  concluded 

that  dichotomous  scoring  methods  are  equivalent  to  intensity  scoring  methods. 
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A rat ing- scoring  technique  for  evaluating  free  response  answers 
was  described  and  illustrated  by  Canter  (1953).  The  technique  involved 
using  raters  to  sort  responses  to  each  of  four  questions  into  a seven 
category  forced  normal  distribution.  The  score  for  each  respondent  was 
the  sum  of  the  category  numbers  assigned  by  the  raters  over  the  four 
questions.  Interrater  reliabilities  of  .85  and  .88  were  reported. 

A procedure  for  correcting  the  influence  of  social  desirability 
(SD)  response  set  in  opinion  research  was  reported  by  Smith  (1967). 

Several  SD  items,  to  which  it  is  assumed  the  true  response  is  known 
(e.g.,  "Do  you  like  everyone  you  meet?")  are  included  in  the  questionnaire. 
The  SD  score  from  these  items  is  then  correlated  with  each  of  the  other 
items  on  the  inventory.  The  responses  on  those  items  with  a statistically 
significant  correlation  can  then  be  corrected,  by  moving  the  response  one 
or  more  steps  from  the  socially  desirable  response,  to  give  a more 
accurate  result. 


Properties  .^nd  Uses  of  Ipsative  Scores 

During  the  literature  review,  some  attention  was  paid  (but  not  an 
exhaustive  review)  to  the  topic  of  ipsative  scores  since  their  properties 
are  not  well  known  but  should  be  understood  by  those  who  design  question- 
naires. It  was  in  1944  that  Cattell  (1944)  noted  that  psychological 
measurement  could  be  expressed  in  three  kinds  of  units;  interactive, 
normative,  and  ipsative.  Interactive  units  are  exemplified  by  the  typical 
"raw"  units  of  psychology,  where  one  is  measuring  the  interaction  of  the 
individual  and  his  environment.  Interactive  units  are  neither  dependent 
upon  any  other  scores  of  the  individual  measured  nor  upon  the  scores  of 
other  individuals.  Normative  units  are  interactive  measurements  relative 
to  a group  of  persons,  or  in  terms  of  a population  of  measurements  provided 
by  a population  of  persons.  Hence,  the  score  of  any  given  individual  is 
dependent  upon  the  scores  of  others  in  the  population.  Most  scores  in 
behavioral  measurement  are  of  this  type.  Ipsative  units,  on  the  other  hand, 
are  interactive  measurements  in  terms  of  a population  of  measurements 
within  an  individual.  Hence,  the  score  for  an  individual  on  a variable 
is  depjndent  upon  his  scores  on  other  variables. 

Several  derivations  of  the  three  forms  given  above  are  possible. 

Where  .interactive  measures  are  first  scored  normatively  and  then  scored 
ipsatively,  "normative  ipsative"  units  are  produced.  "Ipsative  normative" 
scores  are  also  obtainable  when  ipsative  units  are  themselves  treated 
normatively.  As  Cattell  ( 1944)  pointed  out,  however,  ipsative  normative 
scores  are  not  the  same  as  normative  Ipsative  scores. 

Ipsative  scores  can  be  obtained  in  one  of  two  ways:  arithmetically 

through  the  use  of  various  scaling  procedures;  and  experimentally.  Experi- 
mental iy  ipsative  scores  are  produced,  for  example,  by  the  forced  choice 
technleiue,  the  use  of  paired  comparisons,  and  the  Q-sort.  Examples  of  well- 
known  tests  that  produce  Ipsative  scores  are  the  Al  Iport-Vernon  Study  of 


Values,  the  Edwards  Personal  Preference  Schedule,  and  the  Kuder  Preference 
Record.  A set  of  ali..rlbute  measures  is  defined  as  ipsative  when  the  sum 
of  scores  over  all  attributes  is  a constant  for  each  entity. 

The  properties  of  ipsative  measures  were  investigated  by  Clemans 
in  1956  and  reported  in  a Psychometric  Monograph  in  1965.  Some  of  his 
major  findings, based  on  arithmetically  ipsatized  scores, and  related 
implications  and  recommendations  are  given  in  detail  below  since,  as 
previously  noted,  the  properties  of  ipsative  scores  are  still  not  well 
known. 

1.  There  is  always  a set  of  raw  or  absolute  measures  underlying  an 
ipsative  set.  They  may  be  very  difficult  or  impossible  to  obtain,  but  in 
theory  they  are  there. 

2.  There  will  be  a large  number  of  negative  values  in  any  ipsative 
correlation  matrix. 

3.  Ipsative  intercorrelation  matrices  are  nonbasic  or  singular  and 
thus  have  no  regular  inverse.  Hence,  if  regression  weights  are  to  be 
determined  for  a complete  set  of  ipsative  variables,  special  procedures 
(such  as  Iterative  techniques)  will  have  to  be  employed  or  one  of  the 
ipsative  variables  will  have  to  be  deleted. 

4.  The  least-square  estimate  of  a criterion  using  all  the  variables 
of  an  ipsative  set  is  identical  with  the  least-square  solution  with  any 
single  variable  deleted,  regardless  of  the  validity  coefficient  of  that 
variable. 

5.  Ipsative  scores  must  always  be  interpreted  as  relative  and  not 
absolute  measures.  Ipsative  variables  are  highly  interdependent. 

6.  Except  in  one  special  case,  the  ipsative  multiple  correlation  is 
always  less  than  the  multiple  correlation  for  the  same  variables  prior  to 
ipsatizing.  Whenever  possible  then,  absolute  measures  should  be  used  for 
measuring  attributes  of  behavior. 

7.  If  the  underlying  absolute  measures  have  zero  correlation  with 
each  other,  or  they  all  correlate  to  some  constant  degree,  the  ipsative 
intercorrelations  will  all  be  a negative  constant  value  determined  only 
by  the  number  of  variables.  This  again  shows  the  high  interdependence. 

8.  Under  certain  restrictions,  the  ipsative  covariance  matrix  and 
the  first  centroid  residual  of  the  absolute  measures  are  identical,  and 
the  property  seems  to  hold  very  well  even  without  the  restrictions.  This 
is  the  same  as  stating  that  a tremendous  amount  of  information  is  missing 
in  the  ipsative  set.  The  fact  that  this  information  is  missing  from  the 
Ipsative  set  will  make  it  next  to  impossible  to  make  anything  psychologi- 
cally meaningful  out  of  a factor  analysis  of  such  data,  other  than  a 
determination  of  the  rank  of  the  matrix. 


9.  If  the  means  and  variances  of  a set  of  scores  were  not  equated 
prior  to  ipsatizing  the  resulting  scores  will  have  little  meaning.  Although 
the  means  can  be  adjusted  after  ipsatizing,  the  variances  cannot.  Hence, 
the  ipsative  test  maker  must  be  as  certain  as  he  can  that  equal  variances 
are  maintained  for  the  absolute  scales  underlying  an  ipsative  set  even 
though  the  absolute  measures  cannot  be  observed.  How  this  is  to  be  accom- 
plished is  unknown. 

10.  The  magnitude  of  ipsative  scores  must  never  be  confused  with  the 
magnitude  of  the  absolute  measures  fo.  the  same  set  of  variables.  It  is 
quite  possible  that  a person  with  a low  ipsative  score  on  a particular 
trait  may  actually  possess  more  of  the  trait  than  a person  having  a much 
higher  ipsative  score. 

11.  Although  nonipsative  measures  contain  more  information  than 
ipsative  measures,  it  is  not  usually  an  easy  task  to  develop  absolute 
measures  that  correspond  to  the  variables  in  an  ipsative  set.  It  was  the 
difficulty  of  obtaining  valid  absolute  measures  that  lead  to  the  develop- 
ment of  some  of  the  available  ipsative  instruments,  such  as  attempts  to 
eliminate  the  social  desirability  factor.  Hence,  some  traits  that  may  be 
relatively  easily  compared  using  ipsative  techniques  may  be  very  difficult 
to  assess  validly  using  Instruments  designed  to  yield  more  direct  or 
absolute  measures. 

Seven  other  studies  were  reviewed  that  also  investigated  the  properties 
and  uses  of  ipsative  scores.  Block  (1957)  reported  the  results  of  two 
studies  in  which  ipsative  ratings,  treated  normatively,  were  correlated 
with  corresponding  normative  ratings  in  a test  of  the  functional  equivalence 
of  the  two  forms  of  measurement.  Both  of  the  analyses  showed  an  almost 
complete  equivalence  between  the  two  methods.  Some  of  the  advantages  of 
the  ipsative  approach  were  also  presented. 

Horst  and  Wright  (1959)  reported  on  the  comparative  reliability  of 
an  arithmetically  ipsatized  rating  scale  and  its  experimentally  Ipsative 
counterpart.  The  experimentally  ipsative  scores  were  obtained  from  the 
forced  choice  format  Edwards  Personal  Preference  Schedule,  the  individual 
items  of  which  were  also  administered  in  rating  scale  form  to  obtain  inter- 
active scores.  The  interactive  scores  were  standardized  by  variable  to 
produce  normative  scores,  and  these  were  then  arithmetically  ipsatized  over 
persons  to  produce  normative  ipsative  measures.  It  was  found  that  the 
average  reliability  of  the  variables  for  the  arithmetically  ipsatized 
rating  scale  form  was  .87,  while  for  the  experimentally  ipsative  scale 
it  was  .78,  even  though  the  administration  time  for  the  rating  scale  was 
only  about  one-third  that  of  the  EPPS,  They  concluded  that  any  advantage 
which  the  forced  choice  type  of  self-appraisal  instrument  may  have  over  the 
arithmetically  ipsatized  rating  scale  must  be  other  than  that  of  greater 
reliability.  They  also  suggested  that  other  possible  advantages  be 
investigated  in  further  research.  In  a related  report,  Wright  (1961) 
compared  the  three  types  of  measures  with  respect  to  their  intercorrelations 
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and  factor  structures.  The  data  suggested  tliat  the  normative  approach 
provided  more  significant  measures  since  one  less  factor  was  required  to 
extract  all  of  the  approximate  reliable  variance  from  the  intercorrelations 
of  the  experimentally  ipsative  units  than  from  the  normative  and  arith- 
metically ipsative  scores.  An  inspection  of  the  unrotated  factor  patterns 
showed  that  the  first  normative  units  factor  was  not  adequately  matched  by 
either  a normative  ipsative  or  an  experimentally  ipsative  factor.  It  was 
tentatively  identified  as  a factor  of  social  desirability,  the  attempted 
minimization  of  which  was  the  reason  the  EPPS  was  developed  in  forced 
choice  format. 

The  equivalence  of  ipsative  and  normative  fJersonality  measures  was  also 
studied  by  Heilbrun  (1963)  with  regard  to  interscale  correlation  and  relative 
validities.  Using  normative  check  lists  and  forced  choice  ipsative  Q -sorts, 
the  results  were  interpreted  as  supporting  the  use  of  ipsative  measures  for 
normative  predictions.  Concerning  reliability,  Tenopyr  (1968)  noted  that 
the  practice  of  resorting  to  stability  coefficients  as  reliability  estimates 
for  ipsative  scores  is  not  a satisfactory  method  by  itself  since  these 
coefficients  are  subject  to  scale  interdependency.  He  suggested  that  the 
recommended  practice  for  establishing  the  reliability  of  ipsative  inventory 
scales  should  involve  the  establishment  of  internal  consistency  and  stability 
for  the  scales  prior  to  putting  them  into  the  forced  choice  form. 

Two  reports  mentioned  the  "degree  of  ipsativity."  Smith  (undated,  but 
around  1965)  reviewed  the  relevant  literature  describing  the  mathematical 
and  empirical  properties  of  ipsative  and  nonipsative  measures.  The  review 
led  to  the  explication  of  a simple  procedure  for  quantifying  the  "degree 
of  ipsativity"  in  psychological  measurement  instruments.  After  evaluating 
several  published  research  studies  against  the  index,  he  concluded  that 
purely  ipsative  tesr  instruments  possess  such  extensive  psychometric  and 
statistical  limitations  that  utilization  of  such  instruments  is  not  advisable. 
Hicks  (1970)  came  to  the  same  conclusion  in  what  could  be  a later  report 
on  the  same  study.  He  went  on  to  suggest,  however,  that  ipsative  tests 
should  be  used  only  in  situations  where  it  has  been  demonstrated  that: 
significant  response  bias  exists;  this  bias  reduces  validity;  and  an 
ipsative  format  successfully  reduces  bias  and  Increases  validity  to  a 
greater  extent  than  do  nonipsative  controls  for  bias.  Since  Hicks  felt 
that  little  of  the  research  utilizing  ipsative  measures  fulfilled  these 
requirements,  he  believed  that  it  is  necessary  to  reevaluate  thoroughly 
the  extensive  body  of  research  that  has  used  purely  ipsative  forced  choice 
tests  and  that  have  employed  statistical  techniques  predicated  upon  assump- 
tions which  such  instruments  necessarily  violate.  It  may  be  noted  that 
the  conclusions  of  both  Smith  and  Hicks  are  somewhat  more  extreme  than  those 
of  Clemans  (1965). 


Data  Analyses 


Generally,  reports  on  data  analyses  were  beyond  the 
review  unless  specifically  connected  with  some  aspect  of 
construction. 


scope  oi  this 
questionnaire 
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Most  that  were  located  were 
junction  with  their  related 
here , 


discussed  above  in  other  chapters  in 
topic.  Four  articles,  however, may  be 


con- 

noted 


Stevens  (1946)  pointed  out  four  kinds  of  measurement  scales; 
nominal,  ordinal,  interval,  and  ratio.  Appropriate  statistical  analyses 
are  associated  with  each.  Hence,  the  data  analysis  limitations  of  various 
forms  of  questionnaires  should  be  considered  before  an  instrument  is 
designed.  For  example,  fxom  a power  of  the  statistic  point  of  view  less 
can  be  done  with  open-ended  questions  than  with  ranking  questions. 


A statistical  measure  most  appropriately  used  in  conjunction  with 
the  method  of  paired  comparisons  was  reported  by  Balinsky,  Blum,  and 
Dutka  (1951).  Called  the  coefficient  of  agreement,  it  enables  the 
experimenter  to  measure  the  degree  and  test  the  signliicance  of  agreement 
among  observers  as  to  their  preferences  for  a series  of  items  offered  for 
consideration.  It  can  be  readily  used  in  the  construction  and  testing 
of  attitude  and  opinion  scales. 


Litwak  (1956)  points  out  that  ^ hoc  rules  on  question  wording  can 
be  systematically  defined  by  the  constraints  of  latent  structure  analysis. 
And  Reynolds  (1966)  attempts  to  determine  the  degree  of  difference  between 
two  ratings  required  for  statistical  significance  with  samples  of  varying 
sizes. 


m 
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Chapter  XII 

ElECOMMENDED  AREAb  FOR  FURTHER  RESEARCH 


This  chapter  contains  recommendations  for  further  research  based 
upon  a lack  of  empirical  research  or  contradictions  in  results  of  the 
studies  reviewed  in  the  previous  cliapters.  The  section  headings  used 
correspond  to  the  previous  chapters. 


Advantages  and  Disadvantages  of  Various  Types  of  Questionnaires 

1.  Because  cf  the  lack  of  stress  in  this  review  on  mail  question- 
naires, only  a few  articles  were  discussed  in  Chapter  II.  Addi- 
tional information  could  probably  be  found  by  extending  the 
literature  search. 

2.  More  research  appears  to  be  needed  on  the  benefits,  validity,  and 
reliability  of  combinations  of  questionnaire  methods,  for  example 
interview  and  self-administered  questionnaires. 

Selection  of  Questionnaire  Items  to  Be  Used 

1.  More  research  appears  to  be  needed  on  the  comparison  of  ranking  and 
rating  techniques.  For  example,  there  is  some  evidence  that  con- 
clusions based  upon  a single  judge  d’f-Fer  from  those  based  upon 
multiple  judges.  Also,  more  studies  need  to  be  designed  where  the 
items  to  be  ranked  or  rated  are  as  comparable  as  possible. 

2.  Contradictary  evidence  was  obtained  regarding  the  coniparison  of 
ranking  and  paired  comparison,  which  suggests  further  research. 

3.  More  studies  need  to  be  conducted  on  the  comparison  of  rating 
scales  and  forced  choice  items,  where  identical  items  are  used 
in  both  forms. 

4.  Since  few  studies  were  located  on  the  comparison  of  rating  scales 
and  card  sorts,  rating  scales  and  semantic  differential  items,  and 
rating  scales  and  check  lists,  more  studies  can  be  carried  out  in 
these  areas. 

5.  More  research  is  needed  on  the  comparison  of  multiple  choice  items 
with  other  item  types. 

6.  A more  critical  and  detailed  review  is  needed  regarding  issues 
rijlateu  to  forced  choice  and  paired  comparison  items. 


i 
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7.  A more  extensive  literature  review  regarding  tiie  use  of  the  seman- 
tic differential  might  be  in  order. 

8.  Because  of  the  focus  of  this  study, few  articles  were  located  on 
card  sorts  and  projective  items,  suggesting  that  a more  complete 
literature  review  could  be  conducted. 

9.  Few  studies  concerning  check  lists,  open-ended  items,  rearrangement 
items,  and  matching  items  were  uncovered,  suggesting  another  possi- 
ble area  of  additional  research. 

Comparison  of  Scaling  Techniques 

1.  This  review  did  net  stress  scaling  techniques  on  which  many  articles 
have  been  written.  It  is  suggested  that  a review  of  the  literature 
could  be  done  stressing  just  scaling  techniques  in  regard  to  question- 
naire construction. 

Effects  of  Variation  in  Presentation  of  Questionnaire  Items 

1.  Even  though  the  review  of  the  literature  revealed  that  pictures 
can  be  effectively  employed  in  questionnaires,  the  review  should 
be  extended  to  determine  other  modes  of  item  presentation  that  can 
be  used  in  questionnaires  (i.e.  use  of  tape  recorders  or  physical 
objects) . 

2.  Follow-up  research  is  warranted  in  the  area  of  question  stem  word- 
ing. Many  important  issues  have  been  raised  by  the  studies  presen- 
ted, but  there  has  been  little  systematic  pursuit  of  the  issues  to 
a conclusion. 

3.  Since  no  research  studies  were  uncovered  which  examined  the  wording 
of  response  alternatives,  research  needs  to  be  done  in  this  area. 

4.  More  attention  has  been  devoted  to  measures  of  item  difficulty  than 
to  the  effects  of  item  difficulty  on  questionnaire  responses. 
Additional  attention  needs  to  be  focused  on  item  difficulty  response 
tendencies  such  as  acquiescence,  "don't  know,"  and  "no  responses." 

5.  The  effects  of  the  length  of  a question  stem  is  an  under-researched 
area.  Studies  should  be  conducted  to  the  point  where  generalized 
conclusions  can  be  made. 

6.  Experiments  controlled  for  subjects'  characteristics,  topical  area, 
scale  length  and  instrument  should  be  conducted  to  determine  the 
effects  of  the  order  of  response  alternatives. 

7.  No  research  was  uncovered  relative  to  adjective  location  in  the  stem 
of  the  question  versus  adjective  location  in  the  response  alterna- 
tive. Such  research  needs  to  be  done. 


Number  of  Response  Alternatives  and  Response  AnchorlnR 


1.  Additional  research  in  the  area  of  the  optimal  number  of  response 
alternatives  to  use  is  warranted.  This  research  should  cover: 
the  different  types  of  rating  scales;  various  topical  areas  of 
research;  and  subjects  with  different  ability,  educational  and 
sociodemographic  characteristics.  From  such  research  information 
would  be  available  regarding  the  optimal  number  of  response  alter- 
natives to  employ  for  any  specific  type  of  investigation  situation. 

2.  Additional  work  needs  to  be  done  on  the  use  of  balanced  versus  un- 
balanced scales. 

Order  of  Perceived  Favorableness  of  Commonly  Used. Words  and  Phrases 

1.  Even  though  extensive  work  has  been  done  on  the  order  of  perceived 
favorableness  of  commonly  used  words  and  phrases,  individual  in- 
vestigators may  want  to  determine  the  order  of  perceived  favorable- 
ness of  words  which  are  not  included  in  the  lists  in  Chapter  VII 
and  which  are  commonly  used. 

Considerations  Related  to  the  Physical  Characteristics  of  Questionnaires 

1.  Research  needs  to  be  conducted  in  regard  to  the  location  of  response 
alternatives  relative  to  the  question  stem. 

2.  Additional  studies  need  to  be  carried  out  to  determine  the  effect 
that  the  length  of  a questionnaire  has  on  both  the  respondents' 
mptlvation  and  on  the  return  of  mailed  questionnaires. 

3.  Aiiother  possible  area  of  research  would  be  to  determine  the  relations 
o|  questionnaire  length  to  response  consistency  and  validity. 

4.  Systematic  research  needs  to  be  done  on  the  physical  appearance  of 
questionnaires  including  type  size,  spacing,  color,  type  of  paper 
and  the  use  of  pictures. 

Considerations  Related  to  the  Administration  of  Questionnaires 

1.  More  systematic  research  is  needed  to  determine  the  range  of  varia- 
tions in  Instructions  that  may  affect  the  results  obtained  from 
questionnaires  and  on  the  effects  of  variations  in  respondent  under- 
standing of  instructions. 

2.  Further  research  is  needed  on  the  effects  of  administration  time  on 
subject's  motivation,  and  on  the  effects  of  setting  time  limits 

for  completing  questionnaires. 

3.  No  studies  were  uncovered  that  were  concerned  with  the  effects  of 
the  administrators  of  questionnaires  in  the  military  setting. 

For  example,  the  military  rank  of  the  person  administering  a ques- 
tionnaire may  have  an  effect,  as  might  whether  the  administrator 
is  in  the  military  or  not. 
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4.  It  is  apparent  that  additional  research  is  needed  on  the  effects 
of  administrative  conditions.  Such  research  should  include  the 
study  of  fatigue  factors. 

Characteristics  of  Respondents  that  Influence  Questionnaire  Results 

1,  Controls  for  all  types  of  response  bias  need  research. 

2.  The  question  of  "Is  attitudinal  bias  and  characteristics  bias  auto- 
matically eliminated  with  stringent  sampling  controls  or  must 

each  instrument  take  this  into  account?"  needs  to  be  resolved. 

Considerations  Related  to  the  Evaluation  of  Questionnaire  Results 

1.  A more  extensive  review  should  be  made  of  work  related  to  the  prop- 
erties and  uses  of  ipsative  scores  and  research  should  be  under- 
taken to  fill  the  gaps  since  procedures  and  techniques  producing 
such  scores  are  in  wide  use. 

2.  The  literature  review  could  be  expanded  to  include  scoring  and 
data  analysis,  as  related  to  questionnaire  construction. 

General  Recommendations 


1.  The  literature  review  could  be  expanded  to  cover  citations  that 
were  not  abstracted. 

2.  The  present  bibliography  could  be  refined,  maintained,  and  updated. 
Possibly  it  could  be  computerized  so  that  requests  for  needed  in- 
formation could  quickly  be  answered. 

3.  The  present  literature  review  could  be  reviewed  by  senior  consul- 
tants in  the  field  and  expanded  or  modified  on  the  basis  of  their 
suggestions . 

4.  Many  conclusions  presented  in  this  review  could  be  tested  in  relation 
to  the  military  situation. 

5.  An  attempt  could  be  made  to  collect  data  about  relevant  issues  on 
questionnaire  construction  from  groups  who  routinely  administer 
questionnaires  but  who  might  not  publish  their  findings. 
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